프로그램 사용/ai 프로그램
llama.cpp windows cuda12 1080 ti 11GB * 2 테스트
구차니
2026. 4. 25. 15:50
리눅스 용으로 cuda 12.x로 빌드하려니 까마득해서 그냥 윈도우로 시도
모델이 작은 걸로 해서 그런가 성능향상은 없는것 같기도 하고.. -sm 옵션 줘서 해봐야 하려나?

구글 검색하니 이상한 키워드가 나와서 사용불가 -_-
아무튼 layer가 기본이다~ row는 병렬이다~ tensor도 병렬이다~ 라는데
일단 내꺼가 구형이라 그런가 layer가 전반적으로 더 잘나온다.
| D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli -m ..\gemma-4-E4B-it-UD-Q8_K_XL.gguf -sm graph ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll error while handling argument "-sm": invalid value usage: -sm, --split-mode {none,layer,row,tensor} how to split the model across multiple GPUs, one of: - none: use one GPU only - layer (default): split layers and KV across GPUs (pipelined) - row: split weight across GPUs by rows (parallelized) - tensor: split weights and KV across GPUs (parallelized, EXPERIMENTAL) (env: LLAMA_ARG_SPLIT_MODE) to show complete usage, run with -h |
pcie 버전과 x8 에 2개로 나눠져서 그런가? layer가 더 처참하다
gemma-4-E4B-it-UD-Q8_K_XL.gguf 8.05GB
| none | 40 t/s |
| layer | 36 t/s |
| row | 9 t/s |
| tensor | 24 t/s |
Qwen3.6-35B-A3B-UD-IQ1_M.gguf 9.35GB
| none | 40 t/s |
| layer | 44 t/s |
| row | 9 t/s |
| tensor | 21 t/s |
커지면 좀 layer가 효과가 생기나?
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf 15.9GB
| none | 13 t/s |
| layer | 43 t/s |
| row | - |
| tensor | - |
| row 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm row ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll Loading model... -D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:924: GGML_ASSERT(tensor->view_src == nullptr) failed |
| tensor 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm tensor ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll Loading model... -ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2560.00 MiB on device 0: cudaMalloc failed: out of memory D:/a/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed |
Qwen3.6-27B-Q5_K_M.gguf 18.1GB
| none | < 2 t/s |
| layer | 7 t/s |
| row | 5 t/s |
| tensor | - |
| tensor 로드 실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\Qwen3.6-27B-Q5_K_M.gguf -sm tensor ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll Loading model... -ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8192.00 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 8589934592 D:/a/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed |