由于经常需要做测试,所以撰写了一个测试单机多卡的bash脚本,前提需要环境中已经安装nvidia驱动和cuda库,且cuda库安装在默认目录/usr/local/下,然后nccl我是下载的zip包,名字是nccl-master.zip,nccl-tests包也是下载的zip的包,名字是nccl-tests-master.zip,这两个包名字写死了,将下面脚本内容存为脚本,前面提的包放在同一目录,然后使用bash命令进行执行。
脚本内容如下:
#!/bin/bash
#set -xe
CURRENT_PATH=`readlink -f $(dirname $0)`
if [ -f ${CURRENT_PATH}/common.sh ]; then
. ${CURRENT_PATH}/common.sh
else
echo "无法找到公共配置文件!"
exit 1
fi
function BUILD_NCCL_TESTS(){
COMPUTE_CAP=`nvidia-smi --query-gpu=compute_cap --format=csv | grep -v compute_cap | head -1`
COMPUTE_SM=$(echo "($COMPUTE_CAP * 10)/1" | bc)
if [ $COMPUTE_SM -eq 120 ] || [ $COMPUTE_SM -eq 89 ]; then
export TEST_TOTAL="2G"
fi
INFO "当前算力:${COMPUTE_SM}"
cd ${CURRENT_PATH}
if [ ! -d ${CURRENT_PATH}/nccl-master ]; then
WARNING "${CURRENT_PATH}/nccl-master 不目录存,开始解压!"
unzip ${CURRENT_PATH}/nccl-master.zip
fi
cd ${CURRENT_PATH}/nccl-master
mkdir -p ${CURRENT_PATH}/nccl
if [ -d ${CURRENT_PATH}/nccl/lib ]; then
INFO "检测到编译路径 ${CURRENT_PATH}/nccl/lib 存在,开始清理编译文件!"
make clean
fi
INFO "开始编译 nccl..."
make -j$(nproc) src.build BUILDDIR=${CURRENT_PATH}/nccl CUDA_HOME=${CUDA_PATH} NVCC_GENCODE="-gencode=arch=compute_${COMPUTE_SM},code=sm_${COMPUTE_SM}"
if [ $? -eq 0 ]; then
INFO "nccl 编译完成!"
else
ERROR "nccl 编译失败!"
exit 1
fi
cd ${CURRENT_PATH}
if [ ! -d ${CURRENT_PATH}/nccl-tests-master ]; then
WARNING "${CURRENT_PATH}/nccl-tests-master 不目录存,开始解压!"
unzip ${CURRENT_PATH}/nccl-tests-master.zip
fi
cd ${CURRENT_PATH}/nccl-tests-master
if [ -d ${CURRENT_PATH}/nccl-tests-master/build ]; then
INFO "检测到编译路径 ${CURRENT_PATH}/nccl-tests-master/build 存在,开始清理编译文件!"
make clean
fi
INFO "开始编译 nccl-tests..."
make CUDA_HOME=${CUDA_PATH} NCCL_HOME=${CURRENT_PATH}/nccl
if [ $? -eq 0 ]; then
INFO "nccl-tests 编译完成!"
else
ERROR "nccl-tests 编译失败!"
exit 1
fi
export NCCL_TESTS_PATH=${CURRENT_PATH}/nccl-tests-master
}
function NCCL_COMP_TESTS(){
INFO "当前LD_LIBRARY_PATH环境变量:${LD_LIBRARY_PATH}"
cd ${CURRENT_PATH}
if [ ${GPU_TOTAL} -gt 1 ]; then
export LD_LIBRARY_PATH=${CURRENT_PATH}/nccl/lib:$LD_LIBRARY_PATH
INFO "开始单机多卡通信测试..."
${NCCL_TESTS_PATH}/build/all_reduce_perf -b 8 -e ${TEST_TOTAL:-4G} -f 2 -g ${GPU_TOTAL}
if [ $? -eq 0 ];then
INFO "nccl-tests 测试完成!"
else
ERROR "nccl-tests 测试失败,可以尝试手工进行测试,导入环境变量:export LD_LIBRARY_PATH=${CURRENT_PATH}/nccl/lib:\$LD_LIBRARY_PATH ,然后执行命令:${NCCL_TESTS_PATH}/build/all_reduce_perf -b 8 -e 8G -f 2 -g ${GPU_TOTAL}"
fi
else
WARNING "当前卡数: ${GPU_TOTAL}, 不能进行NCCL-TESTS测试!"
fi
}
CHECK_GPU_COMMAND
CHECK_CUDA_PATH
BUILD_NCCL_TESTS
NCCL_COMP_TESTS
然后存为single_nccl_test.sh文件,执行以下命令进行执行
bash single_nccl_test.sh测试结果会写入当前目录下的result_$(hostname)目录下。

内容版权声明:除非注明,否则皆为本站原创文章。
转载注明出处:https://sulao.cn/post/1125
评论列表