load_tensors: offloading output layer to GPU load_tensors: offloading 19 repeating layers to GPU load_tensors: offloaded 20/29 layers to GPU
llama_context: CUDA_Host output buffer size = 0.49 MiB llama_kv_cache: layer 0: dev = CPU llama_kv_cache: layer 1: dev = CPU llama_kv_cache: layer 2: dev = CPU llama_kv_cache: layer 3: dev = CPU llama_kv_cache: layer 4: dev = CPU llama_kv_cache: layer 5: dev = CPU llama_kv_cache: layer 6: dev = CPU llama_kv_cache: layer 7: dev = CPU llama_kv_cache: layer 8: dev = CPU llama_kv_cache: layer 9: dev = CUDA0 llama_kv_cache: layer 10: dev = CUDA0 llama_kv_cache: layer 11: dev = CUDA0 llama_kv_cache: layer 12: dev = CUDA0 llama_kv_cache: layer 13: dev = CUDA0 llama_kv_cache: layer 14: dev = CUDA0 llama_kv_cache: layer 15: dev = CUDA0 llama_kv_cache: layer 16: dev = CUDA0 llama_kv_cache: layer 17: dev = CUDA0 llama_kv_cache: layer 18: dev = CUDA0 llama_kv_cache: layer 19: dev = CUDA0 llama_kv_cache: layer 20: dev = CUDA0 llama_kv_cache: layer 21: dev = CUDA0 llama_kv_cache: layer 22: dev = CUDA0 llama_kv_cache: layer 23: dev = CUDA0 llama_kv_cache: layer 24: dev = CUDA0 llama_kv_cache: layer 25: dev = CUDA0 llama_kv_cache: layer 26: dev = CUDA0 llama_kv_cache: layer 27: dev = CUDA0
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3040 C ...n-cuda-12.4-x64\llama-cli.exe N/A | | 1 N/A N/A 3040 C ...n-cuda-12.4-x64\llama-cli.exe N/A | +-----------------------------------------------------------------------------------------+
D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli --list-devices ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll Available devices: CUDA0: NVIDIA GeForce GTX 1080 Ti (11263 MiB, 10200 MiB free) CUDA1: NVIDIA GeForce GTX 1060 6GB (6143 MiB, 5197 MiB free)
1080 Ti 11GB + 1060 6GB
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf 15.9GB
none
17 t/s with -dev CUDA0 10 t/s with -dev CUDA1
layer
19 t/s
row
-
tensor
-
row 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm row ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
Loading model... -D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error CUDA error: out of memory
tensor 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm tensor ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
Loading model... /D:/a/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8537.78 MiB on device 1: cudaMalloc failed: out of memory
Qwen3.6-27B-Q5_K_M.gguf18.1GB
none
< 2 t/s with -dev CUDA0 < 0.1 t/s with -dev CUDA1
layer
2 t/s
row
-
tensor
-
row 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\Qwen3.6-27B-Q5_K_M.gguf -sm row ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
Loading model... /CUDA error: out of memory current device: 1, in function ggml_backend_cuda_split_buffer_init_tensor at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:956 ggml_cuda_device_malloc((void**)&buf, size, id) D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error
모델이 작은 걸로 해서 그런가 성능향상은 없는것 같기도 하고.. -sm 옵션 줘서 해봐야 하려나?
구글 검색하니 이상한 키워드가 나와서 사용불가 -_-
아무튼 layer가 기본이다~ row는 병렬이다~ tensor도 병렬이다~ 라는데
일단 내꺼가 구형이라 그런가 layer가 전반적으로 더 잘나온다.
D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli -m ..\gemma-4-E4B-it-UD-Q8_K_XL.gguf -sm graph ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll error while handling argument "-sm": invalid value
usage: -sm, --split-mode {none,layer,row,tensor} how to split the model across multiple GPUs, one of: - none: use one GPU only - layer (default): split layers and KV across GPUs (pipelined) - row: split weight across GPUs by rows (parallelized) - tensor: split weights and KV across GPUs (parallelized, EXPERIMENTAL) (env: LLAMA_ARG_SPLIT_MODE)
to show complete usage, run with -h
pcie 버전과 x8 에 2개로 나눠져서 그런가? layer가 더 처참하다
gemma-4-E4B-it-UD-Q8_K_XL.gguf 8.05GB
none
40 t/s
layer
36 t/s
row
9 t/s
tensor
24 t/s
Qwen3.6-35B-A3B-UD-IQ1_M.gguf 9.35GB
none
40 t/s
layer
44 t/s
row
9 t/s
tensor
21 t/s
커지면 좀 layer가 효과가 생기나?
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf 15.9GB
none
13 t/s
layer
43 t/s
row
-
tensor
-
row 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm row ggml_cuda_init: found 2 CUDA devices (Total VRAM: 22527 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
Might also be worth trying to build with an older architecture, e.g. -DCMAKE_CUDA_ARCHITECTURES="75" (which will be run via PTX JIT compilation), to check whether the issue is related to building with 80.
D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64>llama-cli.exe -m ..\gemma-4-E2B-it-Q4_K_M.gguf ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11263 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll
Loading model... /ggml_cuda_compute_forward: SCALE failed CUDA error: no kernel image is available for execution on the device current device: 0, in function ggml_cuda_compute_forward at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2962 err D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error
build : b8918-e583f3b4f model : gemma-4-E2B-it-Q4_K_M.gguf modalities : text
available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read <file> add a text file /glob <pattern> add text files using globbing pattern
> 안녕?
안녕하세요! 저는 사용자님을 돕기 위해 여기 있습니다. 😊
궁금한 점이 있으시거나 도움이 필요하시면 언제든지 말씀해 주세요!
[ Prompt: 182.3 t/s | Generation: 84.7 t/s ]
> 헬로?
[Start thinking] Thinking Process:
1. **Analyze the input:** The user input is "헬로?" (Hello?). It's a very informal greeting, essentially a variation of "Hello?". 2. **Determine the context:** This is a casual interaction following a previous greeting ("안녕?" -> "안녕하세요!"). The user is acknowledging or testing the connection. 3. **Formulate the response strategy:** * Acknowledge the greeting warmly. * Match the casual tone (but remain polite and helpful). * Reiterate availability. * Invite further interaction. 4. **Draft potential responses (in Korean):** * (Option 1 - Simple echo): 네, 안녕하세요. (Yes, hello.) * (Option 2 - Enthusiastic): 네, 안녕하세요! 또 무엇을 도와드릴까요? (Yes, hello! What else can I help you with?) * (Option 3 - Casual/Friendly): 네! 저는 준비되어 있습니다. 😊 (Yes! I am ready.) 5. **Select the best response:** Option 2 is the most effective as it confirms the interaction and immediately prompts the user to state their need, moving the conversation forward.
6. **Final Polish:** (Translate the chosen response into natural Korean.) [End thinking]
네, 안녕하세요! 😊
저는 언제든지 사용자님의 질문에 답하거나 필요한 정보를 제공해 드릴 준비가 되어 있습니다.
궁금한 것이 있으신가요?
[ Prompt: 321.5 t/s | Generation: 84.8 t/s ]
> 으아앙
[Start thinking] Thinking Process:
1. **Analyze the input:** The user input is "으아앙" (Eu-aang). 2. **Determine the nature of the input:** "으아앙" is an interjection or sound, often used to express various emotions (pain, excitement, frustration, distress, playful sounds, etc.). In this context, it's highly informal and emotional. 3. **Assess the context:** The previous interactions were simple greetings ("안녕?", "헬로?"). The current input is a sudden, emotional expression. 4. **Determine the appropriate response strategy:** * Since the input is purely emotional and doesn't pose a question or request a task, the response should be empathetic, neutral, and open-ended. * The tone should be gentle and non-judgmental. * Acknowledge the emotion without demanding further explanation, allowing the user to guide the conversation. 5. **Draft potential responses (in Korean):** * *Option A (Direct inquiry):* 무슨 일 있으세요? (What's going on?) - A bit too direct. * *Option B (Empathy + Open invitation):* 무슨 일이 있으신가요? 괜찮으세요? (Is something happening? Are you okay?) - Good balance. * *Option C (Simple acknowledgment):* 네? (Yes?) / 어떠세요? (How are you?) - Too brief. 6. **Select and refine the best response (B is the most helpful):** Acknowledge the sound/feeling and gently ask if the user needs assistance or wants to talk about it.
7. **Final Check:** The response is safe, empathetic, and keeps the door open for the user to continue the conversation on their terms. (Self-correction: Avoid sounding too robotic. Keep the tone friendly.) [End thinking]
무슨 일이 있으신가요? 😊
혹시 힘든 일이 있으시거나, 이야기하고 싶으신 것이 있다면 편하게 말씀해 주세요. 제가 들어드릴게요.
build : b8916-ffdd983fb model : gemma-4-E2B-it-Q4_K_M.gguf modalities : text
available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read <file> add a text file /glob <pattern> add text files using globbing pattern
> 안녕?
[Start thinking] Thinking Process:
1. **Analyze the input:** The input is "안녕?" (Annyeong?), which means "Hello?" or "Hi?". It is a casual greeting in Korean. 2. **Determine the user's intent:** The user is initiating a conversation or simply saying hello. 3. **Formulate an appropriate response (in Korean):** * Acknowledge the greeting. * Return the greeting. * (Optional but good) Ask how the user is doing or what they need. 4. **Draft potential responses:** * *Simple:* 안녕하세요. (Annyeonghaseyo.) - Slightly formal. * *Casual (matching the user's tone):* 안녕! / 안녕. (Annyeong!) * *More engaging:* 안녕하세요! 저는 잘 지내요. (Hello! I am doing well.) * *Standard AI response:* 안녕하세요! 무엇을 도와드릴까요? (Hello! What can I help you with?) 5. **Select the best response:** Since the user used a casual greeting ("안녕?"), a friendly and equally approachable response is best.
6. **Final Korean output generation:** 안녕하세요! 저는 준비되어 있습니다. (Hello! I am ready.) or simply 안녕하세요! (Hello!) combined with a question.
*(Self-Correction: Keep it simple and friendly.)*
*Selected response:* 안녕하세요! 😊 (Hello! 😊) or 안녕하세요! 반갑습니다. (Hello! Nice to meet you.)
*(Final output generation based on standard conversational flow.)* [End thinking]
안녕하세요! 😊
[ Prompt: 132.7 t/s | Generation: 85.2 t/s ]
> 헬로?
[Start thinking] Thinking Process:
1. **Analyze the input:** The user input is "헬로?" (Hello?). This is a casual greeting, often used in a slightly informal or friendly manner. 2. **Determine the user's intent:** The user is initiating a conversation or simply saying hello. 3. **Formulate an appropriate response:** * Acknowledge the greeting. * Return a friendly greeting. * (Optional but good practice) Ask how the user is doing or invite them to state their purpose. 4. **Draft the response (in Korean):** * A simple greeting is best. (e.g., "안녕하세요.") * A slightly warmer response is good. (e.g., "네, 안녕하세요!") * Ask what they need. (e.g., "무엇을 도와드릴까요?") 5. **Final selection:** Combine these elements into a natural and polite conversational flow. (The response should match the casual tone set by the user.) [End thinking]
네, 안녕하세요! 😊
저는 사용자님을 도와드릴 준비가 되어 있어요. 무엇을 도와드릴까요?
[ Prompt: 285.9 t/s | Generation: 84.6 t/s ]
> 으아앙
[Start thinking] Thinking Process:
1. **Analyze the input:** The user input is "으아앙" (Eu-aang). 2. **Determine the meaning/context:** "으아앙" is an interjection or sound often used to express various emotions, such as: * Crying/Whining (often cute or dramatic). * Excitement/Joy. * Pain/Distress (less common in casual chat unless context is provided). * A general sound of emotion. 3. **Identify the required response strategy:** Since the input is purely emotional/vocal, the response should be empathetic, gentle, and open-ended, inviting the user to share what they are feeling or what they need. 4. **Draft potential responses (Korean):** * *Option 1 (Direct empathy):* 무슨 일 있으세요? (Is something wrong?) * *Option 2 (Gentle inquiry):* 무슨 기분이신가요? (How are you feeling?) * *Option 3 (Simple acknowledgment):* 네, 무슨 일이에요? (Yes, what is it?) * *Option 4 (Playful/Calm):* 괜찮으세요? (Are you okay?) 5. **Select the best response:** Option 2 and 4 are the most supportive. Combining them makes the response warm and responsive.
6. **Final Polish:** (Choose a friendly, open response.) [End thinking]
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2697 C+G ../llama-server 6771MiB | +-----------------------------------------------------------------------------------------+