我目前测试多机多卡的nccl-tests测试都是使用的同一个编译脚本,编译的参数都是一致,所以今天这个问题很奇怪,我在使用openmpi+nccl-tests多机多卡测试有以下输出,然后程序就结束了。
A compressed message was received by the Open MPI run time system
(PMIx) that could not be decompressed. This means that Open MPI has
compression support enabled on one node and not enabled on another.
This is an unsupported configuration.
Compression support is enabled when both of the following conditions
are met:
1. The Open MPI run time system (PMIx) is built with compression
support.
2. The necessary compression libraries (e.g., libz) can be found at
run time.
You should check that both of these conditions are true on both the
node where mpirun is invoked and all the nodes where MPI processes
will be launched. The node listed below does not have both conditions
met:
node without compression support: gn-10-25-201-1
NOTE: There may also be other nodes without compression support.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).我的测试命令如下
export LD_LIBRARY_PATH=/root/build/openmpi/lib:/root/build/nccl/lib:$LD_LIBRARY_PATH
/root/build/openmpi/bin/mpirun -np 16 \
-H node1:8,node2:8 \
--allow-run-as-root -bind-to numa -map-by slot \
-x NCCL_DEBUG=INFO \
-x NCCL_ALGO=Ring \
-x NCCL_MAX_NCHANNELS=16 \
-x NCCL_MIN_NCHANNELS=16 \
-x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_13:1,mlx5_16:1,mlx5_17:1,mlx5_4:1,mlx5_5:1,mlx5_6:1 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_RETRY_CNT=7 \
-x NCCL_IB_TIMEOUT=23 \
-x NCCL_SOCKET_IFNAME=bond1 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_TC=160 \
-x NCCL_CHECKS_DISABLE=1 \
-x LD_LIBRARY_PATH=/root/build/openmpi/lib:/root/build/nccl/lib:$LD_LIBRARY_PATH \
-x PATH=$PATH \
-mca coll_hcoll_enable 0 \
-mca pml ob1 \
-mca btl_tcp_if_include bond1 \
-mca btl ^openib \
/root/toolkit/nccl-tests-master/build/all_reduce_perf -b 1M -e 8G -f 2 -g 1首先这个测试命令肯定是没有问题的,根据提示分析排查,可能是libz的压缩库导致在两个节点上编译的openmpi使用了不同的libz库路径导致。
然后我们通过以下命令进行节点libz库检查
ldconfig -p | grep -E "libz|liblz4|libzstd"
#node1
libzstd.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libzstd.so.1
libz.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libz.so.1
liblz4.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/liblz4.so.1
#node2
libzstd.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libzstd.so.1
libz.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libz.so.1
libz.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libz.so
liblz4.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/liblz4.so.1可以发现node2多了一个libz.so的库,而node1没有,通常压缩库都会从libz.so链接到libz.so.1,这个就是导致我openmpi在两台节点编译出来出现不同压缩库的引用路径导致,所以我们使用如下方法就能解决该问题,只需要进行安装即可
sudo apt-get install zlib1g-dev liblz4-dev libzstd-dev然后我们再使用命令检查库的情况
ldconfig -p | grep -E "libz|liblz4|libzstd"就能看到已经有了libz.so库,这个时候我们重新编译openmpi,再运行上述测试命令发现就能顺利跑完测试了。
除了上述方法我们还可以完全禁用所有压缩库,主要添加下面三个 pmix_base 参数,命令如下
export LD_LIBRARY_PATH=/root/build/openmpi/lib:/root/build/nccl/lib:$LD_LIBRARY_PATH
/root/build/openmpi/bin/mpirun -np 16 \
-H node1:8,node2:8 \
--allow-run-as-root -bind-to numa -map-by slot \
-x NCCL_DEBUG=INFO \
-x NCCL_ALGO=Ring \
-x NCCL_MAX_NCHANNELS=16 \
-x NCCL_MIN_NCHANNELS=16 \
-x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_13:1,mlx5_16:1,mlx5_17:1,mlx5_4:1,mlx5_5:1,mlx5_6:1 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_RETRY_CNT=7 \
-x NCCL_IB_TIMEOUT=23 \
-x NCCL_SOCKET_IFNAME=bond1 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_TC=160 \
-x NCCL_CHECKS_DISABLE=1 \
-x LD_LIBRARY_PATH=/root/build/openmpi/lib:/root/build/nccl/lib:$LD_LIBRARY_PATH \
-x PATH=$PATH \
-mca coll_hcoll_enable 0 \
-mca pml ob1 \
-mca btl_tcp_if_include bond1 \
-mca btl ^openib \
-mca pmix_base_compress 0 \
-mca pmix_base_decompress 0 \
-mca pmix_base_compress_support false \
/root/toolkit/nccl-tests-master/build/all_reduce_perf -b 1M -e 8G -f 2 -g 1另外还有可以在编译的时候完全排除压缩库的编译参数
./configure --prefix=${CURRENT_PATH:=.}/build/openmpi \
--with-cuda=${CUDA_PATH} \
--without-fs-gpfs \
--without-gpfs \
--with-pmix=internal \
--without-zlib \
--without-libz \
--without-lz4 \
--without-zstd \
--enable-mpirun-prefix-by-default这两种方法我还没测试验证过,是查找资料找到的,后面有机会试试。
内容版权声明:除非注明,否则皆为本站原创文章。
转载注明出处:https://sulao.cn/post/1155
评论列表