NVIDIA NVLink A100 GPU卡,需额外安装与驱动版本对应的 nvidia-fabricmanager 服务使 GPU 卡间能够互联通过NVSwitch互联,如果仅安装NVIDIA GPU 驱动程序,会导致GPU不能正常使用。
下面列出了ubuntu和centos7的安装方法,均是通过在线源来安装
1.ubuntu的安装方式如下:
首先需要安装数据中心GPU管理器,安装nvidia DCGM,本次安装是在ubuntu上进行安装,获得存储库密钥:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 apt-key adv --keyserver-options http-proxy=http://proxy-chain.intel.com:911 --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
安装存储库和软件包
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" apt update -y apt install -y datacenter-gpu-manager
安装fabricmanager工具
apt install cuda-drivers-fabricmanager-535
2.centos7安装方式如下:
yum -y install yum-utils yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo yum install -y nvidia-fabric-manager-535.146.02-1 #根据自身的nvidia驱动版本号来修改,我这里是535.146.02
3.设置开机启动,并启动
systemctl enable nvidia-fabricmanager.service systemctl start nvidia-fabricmanager.service # 启动 systemctl status nvidia-fabricmanager.service #查看状态
如果上述工具没有安装在环境有类似如下报错
RuntimeError: cuda runtime error: system not yet initialized