리눅스 용으로 cuda 12.x로 빌드하려니 까마득해서 그냥 윈도우로 시도

모델이 작은 걸로 해서 그런가 성능향상은 없는것 같기도 하고.. -sm 옵션 줘서 해봐야 하려나?

 

구글 검색하니 이상한 키워드가 나와서 사용불가 -_-

아무튼 layer가 기본이다~ row는 병렬이다~ tensor도 병렬이다~ 라는데

일단 내꺼가 구형이라 그런가 layer가 전반적으로 더 잘나온다.

D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli -m ..\gemma-4-E4B-it-UD-Q8_K_XL.gguf -sm graph
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
  Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
error while handling argument "-sm": invalid value

usage:
-sm,   --split-mode {none,layer,row,tensor}
                                        how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs (pipelined)
                                        - row: split weight across GPUs by rows (parallelized)
                                        - tensor: split weights and KV across GPUs (parallelized,
                                        EXPERIMENTAL)
                                        (env: LLAMA_ARG_SPLIT_MODE)


to show complete usage, run with -h

 

pcie 버전과 x8 에 2개로 나눠져서 그런가? layer가 더 처참하다

gemma-4-E4B-it-UD-Q8_K_XL.gguf 8.05GB

none 40 t/s
layer 36 t/s
row 9 t/s
tensor 24 t/s

 

Qwen3.6-35B-A3B-UD-IQ1_M.gguf 9.35GB

none 40 t/s
layer 44 t/s
row 9 t/s
tensor 21 t/s

 

커지면 좀 layer가 효과가 생기나?

gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf 15.9GB

none 13 t/s
layer 43 t/s
row -
tensor -

 

row 로드실패
D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm row
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
  Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll

Loading model... -D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:924: GGML_ASSERT(tensor->view_src == nullptr) failed

 

tensor 로드실패

D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
  Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll

Loading model... -ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2560.00 MiB on device 0: cudaMalloc failed: out of memory
D:/a/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed

 

Qwen3.6-27B-Q5_K_M.gguf 18.1GB

none < 2 t/s
layer 7 t/s
row 5 t/s
tensor -

 

tensor 로드 실패
D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\Qwen3.6-27B-Q5_K_M.gguf -sm tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
  Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll

Loading model... -ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8192.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 8589934592
D:/a/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed
Posted by 구차니