いつの間にか壊れてしまっていたGPU環境を直す

2020-08-03

長らく使っていなかった Ubuntu の GPU 環境を開いたら，NVIDIA ドライバ辺りがなんか死んでることありますよね．なんでですかね．

エラーの詳細

久しぶりに接続して nvidia-smi コマンドを実行したら，いつものエラーが出ました．

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

またか．

電源がうまく切れていなかったり長時間使っていなかったりするとすぐドライバが逝く気がしますが，原因は知りません．

そもそも私がセットアップした環境ではなく，当時の資料もないのでどういう環境だったのかすらわかりません．

PC 環境

Ubuntu 16.04
NVIDIA Tesla P100 16GB x2

GPU の確認

搭載されている GPU がわからない場合は下記のコマンドで調べることができます．

terminal

$ lspci | grep -i nvidia
02:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)

ドライバがイカれていると製品名が表示されず，ID みたいなのが表示されることもあります．

CUDA, cudnn のバージョン確認

NVIDIA のドライバを修正する前に CUDA や cudnn がちゃんと入っているか確認しておきます．

https://blog.mktia.com/get-cuda-and-cudnn-version

調べた結果

CUDA 8.0.0.44
cudnn 6.0.21

でした．古い．

各ドライバの更新

NVIDIA ドライバを入れようにも CUDA が古いので，すべて新しいものに更新します．

Ubuntu 16.04 なのでそれに対応したものを選択する必要がありますが，今回は TensorFlow のドキュメントに従って CUDA 10.1 を入れることにします．

そこで apt-get install しようとしたらエラーが出ました．

terminal

$ sudo apt-get install gnupg-curl
Reading package lists... Done
Building dependency tree       
Reading state information... Done
You might want to run 'apt-get -f install' to correct these:
The following packages have unmet dependencies:
 linux-image-generic : Depends: linux-image-4.4.0-170-generic but it is not going to be installed or
                                linux-image-unsigned-4.4.0-170-generic but it is not going to be installed
 linux-modules-extra-4.4.0-169-generic : Depends: linux-image-4.4.0-169-generic but it is not going to be installed or
                                                  linux-image-unsigned-4.4.0-169-generic but it is not going to be installed
 linux-modules-extra-4.4.0-170-generic : Depends: linux-image-4.4.0-170-generic but it is not going to be installed or
                                                  linux-image-unsigned-4.4.0-170-generic but it is not going to be installed
E: Unmet dependencies. Try 'apt-get -f install' with no packages (or specify a solution).

依存関係のあるパッケージが入っていないようです．

これを解決するためのコマンドを実行するもまたもやエラー．

terminal

$ sudo apt-get -f install
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Correcting dependencies... Done
The following packages were automatically installed and are no longer required:
  linux-headers-4.4.0-150 linux-headers-4.4.0-150-generic linux-headers-4.4.0-151
  linux-headers-4.4.0-151-generic linux-headers-4.4.0-157 linux-headers-4.4.0-157-generic

  ...

After this operation, 436 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://jp.archive.ubuntu.com/ubuntu xenial-updates/main amd64 linux-modules-4.4.0-186-generic amd64 4.4.0-186.216 [12.0 MB]

...

Unpacking linux-modules-4.4.0-186-generic (4.4.0-186.216) ...
dpkg: error processing archive /var/cache/apt/archives/linux-modules-4.4.0-186-generic_4.4.0-186.216_amd64.deb (--unpack):
 cannot copy extracted data for './boot/System.map-4.4.0-186-generic' to '/boot/System.map-4.4.0-186-generic.dpkg-new': failed to write (No space left on device)
No apport report written because the error message indicates a disk full error

...

E: Sub-process /usr/bin/dpkg returned an error code (1)

No space left on device とのことで，ストレージがいっぱいになっているようです．

この場合は以下の方法で解決できることがあります．

https://blog.mktia.com/apt-get-install-is-not-working

apt-get install が動くようになったら，パッケージを取得してインストールしていきます．

最後の行の nvidia-418 の番号はドライバのバージョンです．

求められるバージョンは OS や GPU によって異なり，NVIDIA のドライバダウンロードサイトで調べることができます．

terminal

# Add NVIDIA package repositories
# Add HTTPS support for apt-key
$ sudo apt-get install gnupg-curl
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
$ sudo dpkg -i cuda-repo-ubuntu1604_10.1.243-1_amd64.deb
$ sudo apt-get update
$ wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
$ sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
$ sudo apt-get update

# Install NVIDIA driver
# Issue with driver install requires creating /usr/lib/nvidia
# If not exist
$ sudo mkdir /usr/lib/nvidia
$ sudo apt-get install --no-install-recommends nvidia-418

ここで Ubuntu を再起動します．

terminal

$ sudo reboot

再起動後，改めて接続して nvidia-smi を実行してみてください．

terminal

$ nvidia-smi
Mon Aug  3 03:38:10 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   29C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   31C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

直りました！！

あとは，CUDA と cudnn も更新しておきます．

terminal

# Install development and runtime libraries (~4GB)
$ sudo apt-get install --no-install-recommends \
    cuda-10-1 \
    libcudnn7=7.6.4.38-1+cuda10.1  \
    libcudnn7-dev=7.6.4.38-1+cuda10.1ub

インストール後，バージョンを確認して更新されていれば OK です．

最後に，もう一度 nvidia-smi を拝んでおこうと思ってコマンドを実行したら，なんとエラーが発生．

Failed to initialize NVML: Driver/library version mismatch

これは再起動で直ります．

そして，再起動後に表示された結果が以下のとおりです．

terminal

$ nvidia-smi
Mon Aug  3 03:56:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   29C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   31C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

ここで疑問に思ったことが二つ．

CUDA, cudnn の更新によって NVIDIA ドライバもバージョンアップした
CUDA の欄が 11.0 になっている

一つ目に関しては，バージョンアップの度に NVIDIA ドライバを再インストールしなくても，CUDA に応じてインストールされるようになっているのかもしれません．

二つ目に関しては，紛らわしいので表記変えてほしいです．

私の認識では，ここに表示されているのは NVIDIA ドライバが対応している CUDA バージョンであって，PC にインストールされているもののバージョンではありません．

実際，先ほど入れたのは CUDA 10.1 でした．

たまにミスリードしている記事があるっぽいので注意が必要です．（といいつつ，私が間違っている可能性も）

最後に

なにはともあれ，直ったので良かったです．

Windows で記事書きながら macOS から Ubuntu に SSH していたので効率は悪かったのですが，それでも４～５時間くらいあれば作業が終わるかと思います．

こういう悲劇を生まないために，NVIDIA Container Toolkit を使うのも良さそうです．

参考

GPU Support | TensorFlow

mktia's note

エラーの詳細

PC 環境

GPU の確認

CUDA, cudnn のバージョン確認

各ドライバの更新

最後に

参考