Introduction

Distributed training is necessary for large models training tasks such as neural architecture search supernet, diffusion model or large language models. This document contains the steps to configure machines and launch distributed training tasks with Pytorch framework on GPU servers.

Server Configuration

Step 1 Disable firewall and IPV6

[WARNING] This step may expose the machines under certain threats and attacks.

Disable firewall

$ sudo ufw status
Status: active
$ sudo ufw disable
$ sudo ufw status
Status: inactive

Disable IPV6:

$ sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
$ sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1

Step 2. Export NCCL SOCKET

Some of our machines do not use the default eth0, so we have to configure the NCCL_SOCKET_IFNAME correctly on all the machines.

To check the socket name.

$ ifconfig     
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500                            
        inet xx.xx.xx.xx  netmask 255.255.0.0  broadcast xx.xx.xx.xx          
        ether 02:42:fe:49:77:fe  txqueuelen 0  (Ethernet)                        
        RX packets 0  bytes 0 (0.0 B)                                            
        RX errors 0  dropped 0  overruns 0  frame 0                              
        TX packets 0  bytes 0 (0.0 B)                                            
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0               
                                                                                 
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500                       
        inet xx.xx.123.214  netmask 255.255.255.192  broadcast xx.xx.xx.xx   
        ether 3c:ec:ef:05:1f:ca  txqueuelen 1000  (Ethernet)                     
        RX packets 478564990  bytes 700182821494 (700.1 GB)                      
        RX errors 0  dropped 6  overruns 2135  frame 0                           
        TX packets 880702  bytes 274251519 (274.2 MB)                            
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0               
        device memory 0xc1320000-c133ffff                                        
                                                                                 
eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500                       
        inet xx.xx.123.215  netmask 255.255.255.192  broadcast xx.xx.xx.xx  
        ether 3c:ec:ef:05:1f:cb  txqueuelen 1000  (Ethernet)                     
        RX packets 136141633  bytes 187956572012 (187.9 GB)                      
        RX errors 0  dropped 6  overruns 2335  frame 0                           
        TX packets 178699579  bytes 168433035752 (168.4 GB)                      
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0               
        device memory 0xc1300000-c131ffff         

Install net-tools if you see the error

Command 'ifconfig' not found, but can be installed with:
apt install net-tools
Please ask your administrator.

Export the corresponding socket name

# Both eno1 and eno2 should work.
$ export NCCL_SOCKET_IFNAME=eno2

Step 3. Launch Distributed Training

Get the IP addresses of the machines in the cluster. Note that the ip address should match the NCCL_SOCKET_IFNAME that we export. For example:

lovelace: xx.xx.120.112
hopper: xx.xx.121.104
borg: xx.xx.122.109
johnson: xx.xx.123.214
allen: xx.xx.124.214

Step 4. Run training scripts

Lei Mao wrote a good toy example where we could start with. Here we use 2 machines (–nnodes) as an example. In each node, 4 GPUs (–nproc-per-node) will be used. Therefore, the world size is 8 (8 GPUs in total).

# Host: (--node-rank 0)
torchrun --nnodes=2 --nproc-per-node=4 --node-rank=0 \
        --rdzv-id=456 --rdzv-backend=c10d   \
        --rdzv-endpoint=xx.xx.120.112:29500 \
        train.py --param1 xxx --param2 yyy …

# Client: (--node-rank 1)
torchrun --nnodes=2 --nproc-per-node=4 --node-rank=1 \
        --rdzv-id=456 --rdzv-backend=c10d   \
        --rdzv-endpoint=xx.xx.120.112:29500 \        
        train.py --param1 xxx --param2 yyy …

Network Bandwidth

Pytorch relies on the Internet connection to pass the updated weights. Therefore, a high bandwidth machine that serves as the master will speed up the training.

Test Bandwidth without sudo

$ curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -
...
...
Testing download speed................................................................................
Download: 88.85 Mbit/s
Testing upload speed......................................................................................................
Upload: 11.87 Mbit/s

Test Bandwidth with sudo privilege With sudo privilege, we can test a specific network interface, for example enp194s0f0

$ sudo ethtool enp194s0f0 | grep -i speed
        Speed: 10000Mb/s

Bandwidth Monitoring Tools

Bmon is a good internet bandwidth monitoring tool. Use the command: sudo apt install bmon to install the tool. More tools are available here.

Here is a snapshot of the distributed training. The weights are passing through the enp194s0f1.

Troubleshooting

Hostname not known Issue

Error message:

[W socket.cpp:601] [c10d] The IPv6 network addresses of (ece-a58489.austin.utexas.edu, 44685) cannot be retrieved (gai error: -2 - Name or service not known).

Solution: add hostname and ip to /etc/hosts. It requires sudo privileges.

$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 ece-a55028
xx.xx.120.112 ece-a58489.yyy.zzz # add ip for lovelace

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Debugging

To dump more information for debugging the distributed training, we have to export a few environment variables.

$ export NCCL_DEBUG=INFO
$ export TORCH_CPP_LOG_LEVEL=INFO
$ export TORCH_DISTRIBUTED_DEBUG=INFO

Debugging message:

[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.                                                                                     
WARNING:torch.distributed.run:                                                                                                                                                
*****************************************                                                                                                                                     
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performa
nce in your application as needed.                                                                                                                                            
*****************************************                                                                                                                                     
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.157.244.213, 29500).                                                    
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:29500 on [johnson.ece.utexas.edu]:46672.                                                  
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.157.244.213, 29500).                                                    
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:29500 on [johnson.ece.utexas.edu]:46682.                                                  
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
[I debug.cpp:49] [c10d] The debug level is set to INFO.                                                                                                                       
2023-10-09 23:39:49.119410: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical
 operations.                                                                                                                                                                  
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.                                                  
2023-10-09 23:39:49.170222: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical
 operations.                                                                                                                                                                  
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.                                                  
2023-10-09 23:39:49.170405: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical
 operations.                                                                                                                                                                  
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.                                                  
2023-10-09 23:39:49.170545: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical
 operations.                                                                                                                                                                  
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.                                                  
2023-10-09 23:39:49.851032: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT                                         [488/1839]
2023-10-09 23:39:49.855787: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-09 23:39:49.965667: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
| distributed init (world_size 24, global rank 20, local gpu id 4): env://
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (allen.local, 32927).
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:32927 on [johnson.ece.utexas.edu]:35816.
| distributed init (world_size 24, global rank 18, local gpu id 2): env://
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (allen.local, 32927).
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:32927 on [johnson.ece.utexas.edu]:35822.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (allen.local, 32927).
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:32927 on [johnson.ece.utexas.edu]:35834.
[I ProcessGroupNCCL.cpp:665] [Rank 20] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 20] NCCL watchdog thread started!
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (allen.local, 32927).
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:32927 on [johnson.ece.utexas.edu]:35840.
[I ProcessGroupNCCL.cpp:665] [Rank 18] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 18] NCCL watchdog thread started!
| distributed init (world_size 24, global rank 21, local gpu id 5): env://
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (allen.local, 32927).
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:32927 on [johnson.ece.utexas.edu]:35852.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (allen.local, 32927)./ 
[I socket.cpp:787] [c10d] The client socket has connected to [allen.ece.utexas.edu]:32927 on [johnson.ece.utexas.edu]:35858.
[I ProcessGroupNCCL.cpp:665] [Rank 21] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1

Distributed Training with Pytorch