DGX Spark (GB10)¶
The fairy site runs three NVIDIA DGX Spark nodes (fairy-r02-dgx01 through fairy-r02-dgx03) as GPU worker nodes in the fairy-k8s01 Kubernetes cluster, providing inference and model loading capabilities.
Hardware¶
- SoC: NVIDIA GB10 (Grace CPU + Blackwell GPU via C2C interconnect)
- CPU: Grace, 12-core ARM64
- GPU: Blackwell, 128 TFlops
- Memory: 128GB unified (shared CPU+GPU)
- Storage: 4TB NVMe
- NICs: 2x Mellanox ConnectX-7 QSFP56 (exposed as 4 interfaces via PCIe lane splitting), 1x Realtek RTL8127 10G copper
Driver Selection¶
GB10 devices require the proprietary NVIDIA driver on Talos — open-source drivers are not supported.
| Driver | Package | Version | Status |
|---|---|---|---|
| Proprietary production | nonfree-kmod-nvidia-production |
595.58.03 | In use — works correctly |
| Open production | nvidia-open-gpu-kernel-modules-production |
595.x | CUDA broken (cuInit() error 3) |
| Open LTS | nvidia-open-gpu-kernel-modules-lts |
580.126.20 | Untested on Talos (stock DGX OS ships 580.126.09) |
As of Talos v1.13.0-beta.1 (tested April 2026), the open-source production
driver causes CUDA initialization failures (cuInit() error 3) in containers.
This may be resolved in future Talos or driver releases.
See the Talos proprietary GPU driver docs for setup instructions.
Talos Extensions¶
| Extension | Purpose |
|---|---|
siderolabs/nonfree-kmod-nvidia-production |
Proprietary NVIDIA kernel module |
siderolabs/nvidia-container-toolkit-production |
NVIDIA container runtime |
siderolabs/iscsi-tools |
iSCSI client |
siderolabs/lldpd |
LLDP discovery |
GPU Operator¶
GPU Operator v26.3.0 cannot parse Talos version strings (e.g., v1.13.0-beta.1).
Workaround: Use patched operator image ghcr.io/nvidia/gpu-operator:670be908
Kernel Parameters¶
arm64.nobti— required on GB10 devices or the system may crash and CUDA libraries will not load. See the Talos proprietary GPU docs and talos#13019.
Parameters from the stock DGX OS investigation (not currently applied on Talos, documented for reference):
nvidia_drm modeset=0init_on_alloc=0pci=pcie_bus_safe
Node Taints¶
All DGX nodes are tainted with nvidia.com/gpu=NoSchedule so only workloads
that explicitly tolerate GPU scheduling land on them. Typical workloads include
vLLM inference and model loading.
Network Interfaces¶
Each DGX Spark has 2 physical ConnectX-7 QSFP56 ports and 1 Realtek RTL8127 10G copper port. Each physical CX-7 port is split across two PCIe lanes, presenting as two OS interfaces per physical port.
Physical Port 0 — GPU fabric via fairy-r02-fsw01¶
Connected via a 400G to 2x200G DAC breakout cable to the GPU fabric switch (MikroTik CRS804-4DDQ, fairy-r02-fsw01) for inter-node GPU communication (GPUDirect RDMA, tensor/pipeline parallelism).
| Interface | MTU |
|---|---|
| enp1s0f0np0 | 9000 |
| enP2p1s0f0np0 | 9000 |
Physical Port 1 — Not connected¶
| Interface | MTU |
|---|---|
| enp1s0f1np1 | 9000 |
| enP2p1s0f1np1 | 9000 |
Realtek RTL8127 — Primary network¶
| Interface | Speed | MTU |
|---|---|---|
| enP7s7 | 10G | 1500 |
Handles all Kubernetes and management traffic via fairy-r02-tor01 (Netgear XS724EMv2).