Meta LLAMA를 GPU없이 CPU 메모리에서 돌려보자

머신러닝AI2023. 6. 6. 22:48

Meta LLAMA를 GPU없이 CPU 메모리에서 돌려보자

이 글은 llama.cpp 프로젝트에 대한 소개다. 이 프로젝트는 기본적으로 맥북에서 LLAMA를 돌리는 것이 목표였는데, 다행스럽게도 linux x86에서 충분한 메모리만 있어도 LLAMA의 대형 모델을 구동할 수 있다.

GPU 메모리가 아닌 일반 메모리에서도 LLAMA를 탑재해 구동할 수 있는 것이다! 설명은 우선 ubuntu linux(20.04이지만 상위 버전에서도 잘 되리라 본다) 를 중심으로 해본다. 어렵지 않게 따라가볼 수 있다. 애플 맥북에서도 아래 가이드에 따라 적당하게 골라서 컴파일하면 실행할 수 있다.

https://github.com/ggerganov/llama.cpp

GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++

Port of Facebook's LLaMA model in C/C++. Contribute to ggerganov/llama.cpp development by creating an account on GitHub.

github.com

1) 우선 컴파일을 하자.

1.1) CPU만으로 구동

$ git clone https://github.com/ggerganov/llama.cpp

$ cd llama.cpp

$ make

$ ./main

main: build = 626 (2d43387)
main: seed = 1686058067
..

라고 나오면 정상이다.

1.2) GPU 가속의 도움을 얻도록 구동(GPU보유자만)

혹시 GPU가 있다면 cuBLAS를 통해 GPU가속기능을 쓸 수 있다고 되어 있다. 이 경우에는 cmake를 통해 컴파일해보자. 사실 단순히 구동에 있어서는 필요 없으므로 건너뛰어도 무방하다.

# /work/llama.cpp디렉토리에 설치한다고 가정했다

$ cd /work

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp

$ mkdir build

$ cd build

$ cmake .. -DLLAMA_CUBLAS=ON

$ cmake --build . --config Release

$ cd bin

$ ./main

라고 나오면 정상이다.

2) LLAMA 모델을 다운로드 받자.

이것은 받아둔 사람이 있다면 넘어가면 되고, 받지 않는 사람은 아래 방법으로 받자. 대형 모델은 시간이 좀 오래 걸릴 수도 있다. 아래 LLAMA 모델의 사이즈를 참고하자.

Model	Original Size	Quantized Size(4bit)
7B	13 GB	3.9 GB
13B	24 GB	7.8 GB
30B	60 GB	19.5 GB
65B	120 GB	38.5 GB

# /work/llama_data에 모델을 다운로드 받는다고 가정하자. 다른곳에 저장해도 좋다.

$ cd /work

$ mkdir llama_data

$ git clone https://github.com/juncongmoo/pyllama

$ cd pyllama

$ pip install pyllama -U

$ pip install -r requirements.txt #이 설치가 pyllama설치만으로도 처리되어, 필요했는지 불명확한데 일단 실행하자

#필요한용량을 아래 선택해서 받는다

$ python -m llama.download --model_size 7B --folder ../llama_data

$ python -m llama.download --model_size 13B --folder ../llama_data

$ python -m llama.download --model_size 30B --folder ../llama_data

3) 이제 llama.cpp를 실행해보자.

여기서는 7B 모델을 실행했지만, 다운받은 모델에 맞춰 7B라는 이름만 바꿔서 실행하면 된다.

$ cd /work/llama.cpp

$ python -m pip install -r requirements.txt

$ python convert.py ../llama_data/7B/

$ ./quantize ../llama_data/7B/ggml-model-f16.bin ../llama_data/7B/ggml-model-q4_0.bin q4_0

$ ./main -m ../llama_data/7B/ggml-model-q4_0.bin -n 128

main: build = 626 (2d43387)
main: seed  = 1686058857
llama.cpp: loading model from ../../../llama_data/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 0 MB
.
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0

1960: The Beatles at Blackpool
1960 was an eventful year for the Beatles. John Lennon's father, a merchant seaman, died in January and his mother was admitted to hospital after suffering a nervous breakdown. As a result of these family crises they had to give up their studies at college and work full-time ..

#interactive한 대화도 가능하다

$ ./main -m ../llama_data/7B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

속도는 느리지만, 일반 linux PC에서도 잘 작동하는 것을 알 수 있다.(30B정도 모델이 되면 답답하게 느려짐을 체감할 수 있긴 하다)

4) 맥북 M1, M2 유저라면 맥북의 GPU 가속 기능을 쓸 수도 있다.

M1, M2 챕 맥북을 가진 유저라면 상기 make시에 Metal로 컴파일하여 빠른 속도를 체감해보면 더 좋겠다. 상세내용은 llama.cpp github을 참조하자.

$ LLAM_METAL=1 make

5) 다른 종류의 chat 모델도 시험해볼까?

LLAMA의 다양한 quantized 모델이 huggingface에 등록되어 있다. 예를 들면 TheBloke의 아래 모델을 보자.

https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML

TheBloke/Llama-2-13B-chat-GGML · Hugging Face

Inference API has been turned off for this model.

huggingface.co

여기서 llama-2-13b-chat.ggmlv3.q4_0.bin 라는 chat 모델을 다운로드 받으면 아래와 같이 실험해볼 수 있다.

$ ./main -m llama-2-13b-chat.ggmlv3.q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" --prompt "could you generate python code for generating prime numbers?"

.... 여러가지 CPU에서는 느리지만 잘 나온다는 것을 알 수 있다.

'머신러닝AI' 카테고리의 다른 글

ChatGPT의 Facebook Meta버전 LLAMA v2를 돌려보자 (Ubuntu/NVidia 4080) (0)	2023.07.20
Meta LLAMA를 잘 돌리기 위한 데스크탑 PC 업그레이드 (2)	2023.06.11
가정집에서 LLM을 직접 돌려보려는 사람을 위한 GPU선택 이야기 (1)	2023.06.06
젠슨 황과 일리야 서츠케버의 대담, ChatGPT의 성공이 의미하는 바는? (2)	2023.05.20
인류는 ChatGPT의 학습 데이터를 만들기 위해 여기까지 달려온 것인가? (0)	2023.05.07

Posted by 작동미학

정보엔지니어 (맥스웰의 도깨비, 양자컴퓨터)