[CUDA error 해결하기] RuntimeError: CUDA error: out of memory / For debugging consider passing CUDA_LAUNCH

컴퓨터비전관련

[CUDA error 해결하기] RuntimeError: CUDA error: out of memory / For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

감._.자 2022. 4. 14. 22:53

728x90

첫 번째 에러

RuntimeError: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 1; 23.65 GiB total capacity; 0 bytes already allocated; 30.31 MiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

두 번째 에러

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

이유

1. batch size 가 너무 큰 경우 cuda out of memory (batch size 줄이기)

2. 이전에 실행 시킨 process를 제대로 종료하지 않고 다시 process를 실행한 경우 (이미 메모리가 다 차있어서 더 이상 남은 메모리가 별로 없는거임)

나는 2번 이유였음.

2번 경우를 찾아보니 일단 GPU memory 할당을 한 번 확인해보라는 글이 있어서 terminal을 켠 후 아래 명령어 입력하기!

 watch -n 1 nvidia-smi

봤더니 실행시킨 프로그램이 없는데 메모리가 엄청나게 사용되고 있는걸 확인할 수 있었다..!

(Memory-Usage 를 보면 GPU 0번은 24576MiB 중 10332MiB / GPU 1번은 24576MiB 중 23005MiB)

-> terminal로 실행시킨 코드를 ctrl+c로 강제 중단 시켰을 때 ! 프로세스가 완전히 종료되지 않고 GPU 메모리에 데이터가 남아있는 경우라고 한다!!!

결론: 메모리를 정리하자!!

1. ps aux | grep python 명령어를 통해 실행중인 프로세스를 확인하자 (터미널에 입력)

ps aux | grep python

2. 1번을 실행시킨 결과에서 딥러닝 학습 등 과 관련된 python파일을 찾자

예를 들어 아래처럼 나왔다고 하자

name 22310 0.0 0.1 1869124 232116 pts/10 T 21:58 0:00 python main.py

위는 main.py 가 실행되고 있다는 것을 의미한다. 쟤를 중단시키려면 아래 명령어를 쓰면됨.

main.py의 PID는 22310이니까 kill -9 22310을 입력하면 된다!

kill -9 PID

1,2 를 수행했으면 다시 터미널에 watch -n 1 nvidia-smi 입력하고 메모리 확인해보기!

잘 정리된걸 볼 수 있고 이제 다시 코드가 잘 실행되는 걸 볼 수 있다!

728x90