一、free指针的问题(怀疑是docker上的gcc库没有装全)
*** Error in `python': free(): invalid pointer: 0x00000000020663b0 ***
解决方式:
apt-get install libtcmalloc-minimal4
vim ~/.bashrc
在文件末尾添加
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"
重新载入环境变量
source ~/.bashrc
二、 运行提示共享内存不够
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f660c343898>>
RuntimeError: DataLoader worker (pid 23776) is killed by signal: Bus error. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
换用少一点的卡(比如有四张卡,只用三张)
CUDA_VISIBLE_DEVICES=0,1,2 python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True
三、端口被占用
RuntimeError: Address already in use at /opt/conda/conda-bld/pytorch_1532581333611/work/torch/lib/THD/process_group/General.cpp:17
修改hparams.py
里的端口即可
四、GPU不够
RuntimeError: CUDA error: out of memory
修改hparams.py
里的batch_size
即可
五、有可能出现的奇怪问题
如果出现了内存不够,或者突然卡住iteration不再继续往前推进了
有可能是因为pytorch造成的僵尸进程占用了资源
先ps aux
查看进程情况
然后kill -9 pid
杀死僵尸进程
再通过nvidia-smi
来查看gpu的使用情况