llama.cpp on windows / cuda vs vulkan

구차니 2026. 4. 24. 23:00

두 개 성능 차이가 있을줄 알았는데(cuda가 높을줄..)

거의 오차범위 내에서 동일하게 나왔다.

그럼. 리눅스에서도 cuda 빌드 안해도 쓸만하다는걸려나?

근데 1080 ti 라서 그런가 cuda 13 버전은 오작동하고 12 버전으로 성공함

D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64>llama-cli.exe -m ..\gemma-4-E2B-it-Q4_K_M.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11263 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll

Loading model... /ggml_cuda_compute_forward: SCALE failed
CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_compute_forward at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2962
  err
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error

gemma4 e2b 모델인데 2.89GB 짜리이고 제법 정직하게 용량을 먹는 느낌.

Fri Apr 24 22:57:19 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 582.28                 Driver Version: 582.28         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti   WDDM  |   00000000:01:00.0 Off |                  N/A |
| 27%   39C    P8             11W /  250W |    2936MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1400    C+G   ...-win-vulkan-x64\llama-cli.exe      N/A      |
|    0   N/A  N/A            1452      C   ...al\Programs\Ollama\ollama.exe      N/A      |
|    0   N/A  N/A            9728    C+G   ...em32\Kinect\KinectService.exe      N/A      |
|    0   N/A  N/A           13036    C+G   ...s\Win64\EpicGamesLauncher.exe      N/A      |
+-----------------------------------------------------------------------------------------+

D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-E2B-it-Q4_K_M.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11263 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll

Loading model...

▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8918-e583f3b4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> 안녕?

안녕하세요! 저는 사용자님을 돕기 위해 여기 있습니다. 😊

궁금한 점이 있으시거나 도움이 필요하시면 언제든지 말씀해 주세요!

[ Prompt: 182.3 t/s | Generation: 84.7 t/s ]

> 헬로?

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "헬로?" (Hello?). It's a very informal greeting, essentially a variation of "Hello?".
2.  **Determine the context:** This is a casual interaction following a previous greeting ("안녕?" -> "안녕하세요!"). The user is acknowledging or testing the connection.
3.  **Formulate the response strategy:**
    *   Acknowledge the greeting warmly.
    *   Match the casual tone (but remain polite and helpful).
    *   Reiterate availability.
    *   Invite further interaction.
4.  **Draft potential responses (in Korean):**
    *   (Option 1 - Simple echo): 네, 안녕하세요. (Yes, hello.)
    *   (Option 2 - Enthusiastic): 네, 안녕하세요! 또 무엇을 도와드릴까요? (Yes, hello! What else can I help you with?)
    *   (Option 3 - Casual/Friendly): 네! 저는 준비되어 있습니다. 😊 (Yes! I am ready.)
5.  **Select the best response:** Option 2 is the most effective as it confirms the interaction and immediately prompts the user to state their need, moving the conversation forward.

6.  **Final Polish:** (Translate the chosen response into natural Korean.)
[End thinking]

네, 안녕하세요! 😊

저는 언제든지 사용자님의 질문에 답하거나 필요한 정보를 제공해 드릴 준비가 되어 있습니다.

궁금한 것이 있으신가요?

[ Prompt: 321.5 t/s | Generation: 84.8 t/s ]

> 으아앙

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "으아앙" (Eu-aang).
2.  **Determine the nature of the input:** "으아앙" is an interjection or sound, often used to express various emotions (pain, excitement, frustration, distress, playful sounds, etc.). In this context, it's highly informal and emotional.
3.  **Assess the context:** The previous interactions were simple greetings ("안녕?", "헬로?"). The current input is a sudden, emotional expression.
4.  **Determine the appropriate response strategy:**
    *   Since the input is purely emotional and doesn't pose a question or request a task, the response should be empathetic, neutral, and open-ended.
    *   The tone should be gentle and non-judgmental.
    *   Acknowledge the emotion without demanding further explanation, allowing the user to guide the conversation.
5.  **Draft potential responses (in Korean):**
    *   *Option A (Direct inquiry):* 무슨 일 있으세요? (What's going on?) - A bit too direct.
    *   *Option B (Empathy + Open invitation):* 무슨 일이 있으신가요? 괜찮으세요? (Is something happening? Are you okay?) - Good balance.
    *   *Option C (Simple acknowledgment):* 네? (Yes?) / 어떠세요? (How are you?) - Too brief.
6.  **Select and refine the best response (B is the most helpful):** Acknowledge the sound/feeling and gently ask if the user needs assistance or wants to talk about it.

7.  **Final Check:** The response is safe, empathetic, and keeps the door open for the user to continue the conversation on their terms. (Self-correction: Avoid sounding too robotic. Keep the tone friendly.)
[End thinking]

무슨 일이 있으신가요? 😊

혹시 힘든 일이 있으시거나, 이야기하고 싶으신 것이 있다면 편하게 말씀해 주세요. 제가 들어드릴게요.

[ Prompt: 1014.9 t/s | Generation: 84.9 t/s ]

>

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (GTX 1080 Ti) | 11263 = 7192 + (2981 =  1407 +     780 +     793) +        1090 |
common_memory_breakdown_print: |   - Host                |                 2037 =  1756 +       0 +     281                |

D:\study\llm\llama-b8916-bin-win-vulkan-x64>llama-cli.exe -m ..\gemma-4-E2B-it-Q4_K_M.gguf
load_backend: loaded RPC backend from D:\study\llm\llama-b8916-bin-win-vulkan-x64\ggml-rpc.dll
load_backend: loaded Vulkan backend from D:\study\llm\llama-b8916-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8916-bin-win-vulkan-x64\ggml-cpu-haswell.dll

Loading model...

▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8916-ffdd983fb
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> 안녕?

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "안녕?" (Annyeong?), which means "Hello?" or "Hi?". It is a casual greeting in Korean.
2.  **Determine the user's intent:** The user is initiating a conversation or simply saying hello.
3.  **Formulate an appropriate response (in Korean):**
    *   Acknowledge the greeting.
    *   Return the greeting.
    *   (Optional but good) Ask how the user is doing or what they need.
4.  **Draft potential responses:**
    *   *Simple:* 안녕하세요. (Annyeonghaseyo.) - Slightly formal.
    *   *Casual (matching the user's tone):* 안녕! / 안녕. (Annyeong!)
    *   *More engaging:* 안녕하세요! 저는 잘 지내요. (Hello! I am doing well.)
    *   *Standard AI response:* 안녕하세요! 무엇을 도와드릴까요? (Hello! What can I help you with?)
5.  **Select the best response:** Since the user used a casual greeting ("안녕?"), a friendly and equally approachable response is best.

6.  **Final Korean output generation:** 안녕하세요! 저는 준비되어 있습니다. (Hello! I am ready.) or simply 안녕하세요! (Hello!) combined with a question.

*(Self-Correction: Keep it simple and friendly.)*

*Selected response:* 안녕하세요! 😊 (Hello! 😊) or 안녕하세요! 반갑습니다. (Hello! Nice to meet you.)

*(Final output generation based on standard conversational flow.)*
[End thinking]

안녕하세요! 😊

[ Prompt: 132.7 t/s | Generation: 85.2 t/s ]

> 헬로?

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "헬로?" (Hello?). This is a casual greeting, often used in a slightly informal or friendly manner.
2.  **Determine the user's intent:** The user is initiating a conversation or simply saying hello.
3.  **Formulate an appropriate response:**
    *   Acknowledge the greeting.
    *   Return a friendly greeting.
    *   (Optional but good practice) Ask how the user is doing or invite them to state their purpose.
4.  **Draft the response (in Korean):**
    *   A simple greeting is best. (e.g., "안녕하세요.")
    *   A slightly warmer response is good. (e.g., "네, 안녕하세요!")
    *   Ask what they need. (e.g., "무엇을 도와드릴까요?")
5.  **Final selection:** Combine these elements into a natural and polite conversational flow. (The response should match the casual tone set by the user.)
[End thinking]

네, 안녕하세요! 😊

저는 사용자님을 도와드릴 준비가 되어 있어요. 무엇을 도와드릴까요?

[ Prompt: 285.9 t/s | Generation: 84.6 t/s ]

> 으아앙

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "으아앙" (Eu-aang).
2.  **Determine the meaning/context:** "으아앙" is an interjection or sound often used to express various emotions, such as:
    *   Crying/Whining (often cute or dramatic).
    *   Excitement/Joy.
    *   Pain/Distress (less common in casual chat unless context is provided).
    *   A general sound of emotion.
3.  **Identify the required response strategy:** Since the input is purely emotional/vocal, the response should be empathetic, gentle, and open-ended, inviting the user to share what they are feeling or what they need.
4.  **Draft potential responses (Korean):**
    *   *Option 1 (Direct empathy):* 무슨 일 있으세요? (Is something wrong?)
    *   *Option 2 (Gentle inquiry):* 무슨 기분이신가요? (How are you feeling?)
    *   *Option 3 (Simple acknowledgment):* 네, 무슨 일이에요? (Yes, what is it?)
    *   *Option 4 (Playful/Calm):* 괜찮으세요? (Are you okay?)
5.  **Select the best response:** Option 2 and 4 are the most supportive. Combining them makes the response warm and responsive.

6.  **Final Polish:** (Choose a friendly, open response.)
[End thinking]

무슨 일 있으신가요? 😊 혹시 힘든 일이 있으신가요?

편하게 이야기해 주세요. 제가 들어드릴게요.

[ Prompt: 18.7 t/s | Generation: 85.4 t/s ]

>

저작자표시 (새창열림)