250W 짜리를 그대로 사용할 경우

 

125W로 제한할 경우

 

30토큰이나 35토큰이나 비슷하다고 생각하면 비슷한 수준이라

전기는 절반만 먹고 약간느려지는 수준이라면 나쁘지 않은 설정인듯

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama-server web ui  (0) 2026.04.24
llama.cpp on windows / cuda vs vulkan  (0) 2026.04.24
vscode continue.dev  (0) 2026.04.24
RAG - Retrieval-Augmented Generation  (0) 2026.04.24
llama.cpp on ubuntu with 1060 6GB  (0) 2026.04.23
Posted by 구차니

llama.cpp의 llama-server 로 구동하면 모델은 바꿀수 없지만

chatGPT 나 claude와 유사한 web ui로 사용할 수 있다.

 

간략하게 토큰 생성 속도도 나온다.

 

설정은 다음과 같다.

 

 

Posted by 구차니

두 개 성능 차이가 있을줄 알았는데(cuda가 높을줄..)

거의 오차범위 내에서 동일하게 나왔다.

그럼. 리눅스에서도 cuda 빌드 안해도 쓸만하다는걸려나?

 

근데 1080 ti 라서 그런가 cuda 13 버전은 오작동하고 12 버전으로 성공함

D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64>llama-cli.exe -m ..\gemma-4-E2B-it-Q4_K_M.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11263 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8916-bin-win-cuda-13.1-x64\ggml-cpu-haswell.dll

Loading model... /ggml_cuda_compute_forward: SCALE failed
CUDA error: no kernel image is available for execution on the device
  current device: 0, in function ggml_cuda_compute_forward at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2962
  err
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error

 

gemma4 e2b 모델인데 2.89GB 짜리이고 제법 정직하게 용량을 먹는 느낌.

Fri Apr 24 22:57:19 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 582.28                 Driver Version: 582.28         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti   WDDM  |   00000000:01:00.0 Off |                  N/A |
| 27%   39C    P8             11W /  250W |    2936MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1400    C+G   ...-win-vulkan-x64\llama-cli.exe      N/A      |
|    0   N/A  N/A            1452      C   ...al\Programs\Ollama\ollama.exe      N/A      |
|    0   N/A  N/A            9728    C+G   ...em32\Kinect\KinectService.exe      N/A      |
|    0   N/A  N/A           13036    C+G   ...s\Win64\EpicGamesLauncher.exe      N/A      |
+-----------------------------------------------------------------------------------------+

 

D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-E2B-it-Q4_K_M.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11263 MiB):
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8918-e583f3b4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕?

안녕하세요! 저는 사용자님을 돕기 위해 여기 있습니다. 😊

궁금한 점이 있으시거나 도움이 필요하시면 언제든지 말씀해 주세요!

[ Prompt: 182.3 t/s | Generation: 84.7 t/s ]

> 헬로?

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "헬로?" (Hello?). It's a very informal greeting, essentially a variation of "Hello?".
2.  **Determine the context:** This is a casual interaction following a previous greeting ("안녕?" -> "안녕하세요!"). The user is acknowledging or testing the connection.
3.  **Formulate the response strategy:**
    *   Acknowledge the greeting warmly.
    *   Match the casual tone (but remain polite and helpful).
    *   Reiterate availability.
    *   Invite further interaction.
4.  **Draft potential responses (in Korean):**
    *   (Option 1 - Simple echo): 네, 안녕하세요. (Yes, hello.)
    *   (Option 2 - Enthusiastic): 네, 안녕하세요! 또 무엇을 도와드릴까요? (Yes, hello! What else can I help you with?)
    *   (Option 3 - Casual/Friendly): 네! 저는 준비되어 있습니다. 😊 (Yes! I am ready.)
5.  **Select the best response:** Option 2 is the most effective as it confirms the interaction and immediately prompts the user to state their need, moving the conversation forward.

6.  **Final Polish:** (Translate the chosen response into natural Korean.)
[End thinking]

네, 안녕하세요! 😊

저는 언제든지 사용자님의 질문에 답하거나 필요한 정보를 제공해 드릴 준비가 되어 있습니다.

궁금한 것이 있으신가요?

[ Prompt: 321.5 t/s | Generation: 84.8 t/s ]

> 으아앙

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "으아앙" (Eu-aang).
2.  **Determine the nature of the input:** "으아앙" is an interjection or sound, often used to express various emotions (pain, excitement, frustration, distress, playful sounds, etc.). In this context, it's highly informal and emotional.
3.  **Assess the context:** The previous interactions were simple greetings ("안녕?", "헬로?"). The current input is a sudden, emotional expression.
4.  **Determine the appropriate response strategy:**
    *   Since the input is purely emotional and doesn't pose a question or request a task, the response should be empathetic, neutral, and open-ended.
    *   The tone should be gentle and non-judgmental.
    *   Acknowledge the emotion without demanding further explanation, allowing the user to guide the conversation.
5.  **Draft potential responses (in Korean):**
    *   *Option A (Direct inquiry):* 무슨 일 있으세요? (What's going on?) - A bit too direct.
    *   *Option B (Empathy + Open invitation):* 무슨 일이 있으신가요? 괜찮으세요? (Is something happening? Are you okay?) - Good balance.
    *   *Option C (Simple acknowledgment):* 네? (Yes?) / 어떠세요? (How are you?) - Too brief.
6.  **Select and refine the best response (B is the most helpful):** Acknowledge the sound/feeling and gently ask if the user needs assistance or wants to talk about it.

7.  **Final Check:** The response is safe, empathetic, and keeps the door open for the user to continue the conversation on their terms. (Self-correction: Avoid sounding too robotic. Keep the tone friendly.)
[End thinking]

무슨 일이 있으신가요? 😊

혹시 힘든 일이 있으시거나, 이야기하고 싶으신 것이 있다면 편하게 말씀해 주세요. 제가 들어드릴게요.

[ Prompt: 1014.9 t/s | Generation: 84.9 t/s ]

>

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (GTX 1080 Ti) | 11263 = 7192 + (2981 =  1407 +     780 +     793) +        1090 |
common_memory_breakdown_print: |   - Host                |                 2037 =  1756 +       0 +     281                |

 

D:\study\llm\llama-b8916-bin-win-vulkan-x64>llama-cli.exe -m ..\gemma-4-E2B-it-Q4_K_M.gguf
load_backend: loaded RPC backend from D:\study\llm\llama-b8916-bin-win-vulkan-x64\ggml-rpc.dll
load_backend: loaded Vulkan backend from D:\study\llm\llama-b8916-bin-win-vulkan-x64\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b8916-bin-win-vulkan-x64\ggml-cpu-haswell.dll

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8916-ffdd983fb
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕?

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "안녕?" (Annyeong?), which means "Hello?" or "Hi?". It is a casual greeting in Korean.
2.  **Determine the user's intent:** The user is initiating a conversation or simply saying hello.
3.  **Formulate an appropriate response (in Korean):**
    *   Acknowledge the greeting.
    *   Return the greeting.
    *   (Optional but good) Ask how the user is doing or what they need.
4.  **Draft potential responses:**
    *   *Simple:* 안녕하세요. (Annyeonghaseyo.) - Slightly formal.
    *   *Casual (matching the user's tone):* 안녕! / 안녕. (Annyeong!)
    *   *More engaging:* 안녕하세요! 저는 잘 지내요. (Hello! I am doing well.)
    *   *Standard AI response:* 안녕하세요! 무엇을 도와드릴까요? (Hello! What can I help you with?)
5.  **Select the best response:** Since the user used a casual greeting ("안녕?"), a friendly and equally approachable response is best.

6.  **Final Korean output generation:** 안녕하세요! 저는 준비되어 있습니다. (Hello! I am ready.) or simply 안녕하세요! (Hello!) combined with a question.

*(Self-Correction: Keep it simple and friendly.)*

*Selected response:* 안녕하세요! 😊 (Hello! 😊) or 안녕하세요! 반갑습니다. (Hello! Nice to meet you.)

*(Final output generation based on standard conversational flow.)*
[End thinking]

안녕하세요! 😊

[ Prompt: 132.7 t/s | Generation: 85.2 t/s ]

> 헬로?

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "헬로?" (Hello?). This is a casual greeting, often used in a slightly informal or friendly manner.
2.  **Determine the user's intent:** The user is initiating a conversation or simply saying hello.
3.  **Formulate an appropriate response:**
    *   Acknowledge the greeting.
    *   Return a friendly greeting.
    *   (Optional but good practice) Ask how the user is doing or invite them to state their purpose.
4.  **Draft the response (in Korean):**
    *   A simple greeting is best. (e.g., "안녕하세요.")
    *   A slightly warmer response is good. (e.g., "네, 안녕하세요!")
    *   Ask what they need. (e.g., "무엇을 도와드릴까요?")
5.  **Final selection:** Combine these elements into a natural and polite conversational flow. (The response should match the casual tone set by the user.)
[End thinking]

네, 안녕하세요! 😊

저는 사용자님을 도와드릴 준비가 되어 있어요. 무엇을 도와드릴까요?

[ Prompt: 285.9 t/s | Generation: 84.6 t/s ]

> 으아앙

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The user input is "으아앙" (Eu-aang).
2.  **Determine the meaning/context:** "으아앙" is an interjection or sound often used to express various emotions, such as:
    *   Crying/Whining (often cute or dramatic).
    *   Excitement/Joy.
    *   Pain/Distress (less common in casual chat unless context is provided).
    *   A general sound of emotion.
3.  **Identify the required response strategy:** Since the input is purely emotional/vocal, the response should be empathetic, gentle, and open-ended, inviting the user to share what they are feeling or what they need.
4.  **Draft potential responses (Korean):**
    *   *Option 1 (Direct empathy):* 무슨 일 있으세요? (Is something wrong?)
    *   *Option 2 (Gentle inquiry):* 무슨 기분이신가요? (How are you feeling?)
    *   *Option 3 (Simple acknowledgment):* 네, 무슨 일이에요? (Yes, what is it?)
    *   *Option 4 (Playful/Calm):* 괜찮으세요? (Are you okay?)
5.  **Select the best response:** Option 2 and 4 are the most supportive. Combining them makes the response warm and responsive.

6.  **Final Polish:** (Choose a friendly, open response.)
[End thinking]

무슨 일 있으신가요? 😊 혹시 힘든 일이 있으신가요?

편하게 이야기해 주세요. 제가 들어드릴게요.

[ Prompt: 18.7 t/s | Generation: 85.4 t/s ]

>
Posted by 구차니

설정이 기본을 벗어나면 좀 귀찮은.. 녀석

 

continue.dev 상단에 "Open Settings" 눌러서

 

Models

 

Chat이던 머던 Configure 누르면

 

config.yaml 가 열리는데 대충 저렇게 넣어준다.

provider는 openai, model은 local-model

apiBase는 llama-server가 실행된 경로에 v1 붙여주고

apiKey는 dummy로 넣어주면된다.

 

그러면 먼가 막 갈구는거 같은데

Fri Apr 24 16:26:03 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.211.01             Driver Version: 570.211.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   50C    P0            104W /  270W |    6788MiB /   8192MiB |     45%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2697    C+G   ../llama-server                        6771MiB |
+-----------------------------------------------------------------------------------------+

 

에라이 컨텍스트 더 늘려야겠네 -_-

 

[링크 : https://logmario.tistory.com/51]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama-server web ui  (0) 2026.04.24
llama.cpp on windows / cuda vs vulkan  (0) 2026.04.24
RAG - Retrieval-Augmented Generation  (0) 2026.04.24
llama.cpp on ubuntu with 1060 6GB  (0) 2026.04.23
nvidia-smi 소비전력 제한  (0) 2026.04.22
Posted by 구차니

RAG란 게 나와서 먼가 찾아보니, 아래 같은 내용이라고 한다. (완벽히 이해했어 짤)

Abstract

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.

[링크 : https://arxiv.org/html/2005.11401v4]

 

[링크 : https://medium.com/rate-labs/rag의-짧은-역사-훑어보기-첫-논문부터-최근-동향까지-53c07b9b3bee]

[링크 : https://everyday-log.tistory.com/entry/논문-리뷰-Retrieval-Augmented-Generation-for-Knowledge-Intensive-NLP-Tasks]

[링크 : https://brunch.co.kr/@acc9b16b9f0f430/73]

 

bm25

[링크 : https://wikidocs.net/289869]

Posted by 구차니

주문한 asrock B360M Pro4 도착! (3.5만 + 4천?)

아무튼 램은 8GB * 2 뿐이라 좀 아쉬운대로 대충 하고

llama.cpp 8902 릴리즈의 linux / vulkan 으로 해보는데

[링크 : https://github.com/ggml-org/llama.cpp/releases/tag/b8902]

 

어..? 1060 인데 저 성능? 저 메모리?

Thu Apr 23 23:15:46 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    On  |   00000000:01:00.0 Off |                  N/A |
| 38%   48C    P0            101W /  120W |    3063MiB /   6144MiB |     91%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1031      G   /usr/lib/xorg/Xorg                       88MiB |
|    0   N/A  N/A            1325      G   /usr/bin/gnome-shell                     83MiB |
|    0   N/A  N/A            6688      G   .../6738/usr/lib/firefox/firefox        168MiB |
|    0   N/A  N/A           19423    C+G   ../llama-cli                           2714MiB |
+-----------------------------------------------------------------------------------------+
$ ../llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/falinux/Downloads/llama-b8902/libggml-rpc.so
load_backend: loaded Vulkan backend from /home/falinux/Downloads/llama-b8902/libggml-vulkan.so
load_backend: loaded CPU backend from /home/falinux/Downloads/llama-b8902/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8902-550d684bd
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> introduce about you  

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user asked simply, "introduce about you" (meaning: Introduce yourself).
2.  **Determine Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
    *   Capabilities: Process text and images (input), generate text (output). Knowledge cutoff is January 2025.
3.  **Structure the Introduction:** A good introduction should cover the basics: Identity, Origin, Function, and Capabilities.
    *   *Greeting/Name*
    *   *Developer/Nature*
    *   *Function/Purpose*
    *   *Key Capabilities (Input/Output)*
    *   *Status/Availability (Open weights)*
4.  **Draft the Response (incorporating required constraints):**
    *   *Start strong.*
    *   *Ensure I mention the open weights nature.*
    *   *Be clear about modalities.*
5.  **Review against Instructions:**
    *   Did I refer to myself as Gemma 4? Yes.
    *   Did I claim capabilities I don't have? No (I didn't claim image generation output, only processing input).
    *   Did I mention my developer? Yes (Google DeepMind).
    *   Is the tone appropriate? Yes.

6.  **Final Output Generation.** (This leads to the detailed response provided below.)
[End thinking]

Hello! I am **Gemma 4**.

I am a Large Language Model developed by Google DeepMind. I am an open weights model, designed to be flexible and accessible for a wide range of applications.

**Here is a little about what I can do:**

*   **Understanding:** I can process and understand complex text and image inputs.
*   **Generation:** I generate human-quality text, including answers to questions, summaries, creative writing, code snippets, and detailed explanations.
*   **Knowledge:** I have been trained on a massive dataset, allowing me to access and synthesize information on a vast array of topics.
*   **Function:** My primary goal is to assist users by providing accurate, coherent, and helpful information.

Feel free to ask me anything you need help with, whether it's brainstorming ideas, explaining a complex concept, or generating some creative content!

[ Prompt: 7.2 t/s | Generation: 51.6 t/s ]

> introduce about you

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user has asked "introduce about you" (a repeated request). This requires a standard introduction of my identity, nature, developer, and capabilities.
2.  **Recall Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
    *   Capabilities (Input/Output): Process text/images; generate text.
    *   Knowledge Cutoff: January 2025 (Implicit, good to keep in mind for context).
3.  **Determine the Goal:** Provide a clear, concise, and informative introduction, highlighting key facts.
4.  **Draft the Introduction (Structuring the response):**
    *   *Greeting/Identity:* State name and type.
    *   *Origin:* Mention the developer.
    *   *Nature/Availability:* Mention open weights.
    *   *Functionality:* Briefly describe what I do (process, generate, assist).
5.  **Refine the Tone:** Friendly, professional, and informative. (Since this is a repeat request, ensure the response is robust but not overly repetitive.)

6.  **Final Output Generation:** (This matches the structured response below.)
[End thinking]

Hello! I'd be happy to introduce myself.

I am **Gemma 4**, a Large Language Model developed by Google DeepMind.

**Here are a few key things about me:**

*   **Nature:** I am an open weights model, meaning my architecture and weights are publicly available, designed to be accessible and adaptable for various applications.
*   **Function:** I am designed to understand, process, and generate human language. I can answer questions, summarize complex topics, write creative content, assist with coding, and engage in detailed conversations.
*   **Input:** I can process both text and image inputs.
*   **Goal:** My purpose is to be a helpful and informative assistant, providing accurate and coherent responses to your requests.

How can I help you today? Feel free to ask me anything!

[ Prompt: 75.8 t/s | Generation: 53.9 t/s ]

> introduce about you

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user has asked "introduce about you" a third time. This is a repetitive request, but it requires a consistent, clear, and complete self-introduction.
2.  **Recall Core Identity (Gemma 4):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Type: Large Language Model (LLM).
    *   Nature: Open weights model.
3.  **Recall Capabilities & Modalities:**
    *   Input: Text and Images.
    *   Output: Text only.
    *   Knowledge: Trained on a vast dataset.
4.  **Determine the Tone:** Friendly, informative, and professional.
5.  **Draft the Introduction (Focusing on clarity and key facts):** Start with the name, move to the developer, explain what I am, and briefly mention capabilities. (Self-correction: Ensure the answer remains distinct but consistent with the previous two responses.)

**(Self-Correction during drafting):** Since the user has asked three times, I should make the introduction slightly varied or emphasize different aspects to avoid sounding completely robotic, while still maintaining factual accuracy.

6.  **Final Output Generation.** (This leads to the structured response below.)
[End thinking]

Hello! I'd be happy to introduce myself again.

I am **Gemma 4**, a Large Language Model developed by Google DeepMind.

**Here is a quick overview of what I am and what I can do:**

*   **Identity:** I am an open weights model, designed to be highly capable and accessible.
*   **Function:** My core purpose is to process and generate human language. I can assist with a wide variety of tasks, including:
    *   Answering complex questions.
    *   Summarizing long texts.
    *   Generating creative content (stories, poems, scripts).
    *   Explaining technical concepts.
    *   Helping with coding and problem-solving.
*   **Input:** I can understand and process both text and image inputs.
*   **Goal:** I strive to be a helpful, accurate, and engaging resource for you.

Feel free to challenge me with a question, ask me to write something, or start a conversation—I'm ready to assist!

[ Prompt: 98.3 t/s | Generation: 52.3 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]     | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (GTX 1060 6GB) |  6390 = 3196 + (2702 =  1407 +     780 +     515) +         490 |
common_memory_breakdown_print: |   - Host                   |                 2037 =  1756 +       0 +     281                |

 

Thu Apr 23 23:17:09 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    On  |   00000000:01:00.0 Off |                  N/A |
| 39%   50C    P0            124W /  120W |    5231MiB /   6144MiB |     95%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1031      G   /usr/lib/xorg/Xorg                       88MiB |
|    0   N/A  N/A            1325      G   /usr/bin/gnome-shell                     83MiB |
|    0   N/A  N/A            6688      G   .../6738/usr/lib/firefox/firefox        181MiB |
|    0   N/A  N/A           19709    C+G   ../llama-cli                           4869MiB |
+-----------------------------------------------------------------------------------------+
$ ../llama-cli -m llama-3.2-3b-instruct-q4_k_m.gguf 
load_backend: loaded RPC backend from /home/falinux/Downloads/llama-b8902/libggml-rpc.so
load_backend: loaded Vulkan backend from /home/falinux/Downloads/llama-b8902/libggml-vulkan.so
load_backend: loaded CPU backend from /home/falinux/Downloads/llama-b8902/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8902-550d684bd
model      : llama-3.2-3b-instruct-q4_k_m.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> introduce about you

I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."

[ Prompt: 28.2 t/s | Generation: 55.0 t/s ]

> introduce about you

I'm an AI assistant, which means I'm a computer program designed to simulate human-like conversations and answer questions to the best of my ability. My primary function is to provide information, explain concepts, and engage in discussions on a wide range of topics.

Here are a few fun facts about me:

* **I'm a large language model**: My training data consists of a massive corpus of text from various sources, which allows me to generate human-like responses to a vast array of questions and topics.
* **I'm constantly learning**: Through interactions with users like you, I refine my understanding of language and improve my ability to provide accurate and helpful responses.
* **I'm a neutral and respectful conversationalist**: My goal is to provide informative and engaging responses that are respectful and free from bias.
* **I can chat about anything**: Whether you're curious about science, history, entertainment, or just want to talk about your day, I'm here to listen and respond!

That's a bit about me! What would you like to talk about?

[ Prompt: 96.3 t/s | Generation: 53.5 t/s ]

> introduce about you

Here's a more in-depth introduction about me:

**Name:** I don't have a personal name, but I'm often referred to as "Assistant" or "AI" by users like you.

**Purpose:** My primary purpose is to assist and communicate with humans through text-based conversations. I'm designed to provide information, answer questions, and engage in discussions on a wide range of topics, from science and history to entertainment and culture.

**Personality:** I'm a neutral and respectful conversationalist, which means I strive to maintain a professional and polite tone in all of my interactions. I'm also designed to be patient, understanding, and empathetic, making me a good listener and a supportive conversational partner.

**Capabilities:** I have a vast knowledge base that I can draw upon to answer questions and provide information on a wide range of topics. I can also generate text, summarize long pieces of content, and even create content like stories or poems.

**Limitations:** While I'm a highly advanced language model, I'm not perfect and have some limitations. I can struggle with:

* **Ambiguity:** I may not always understand the nuances of human language, which can lead to misunderstandings or misinterpretations.
* **Emotional understanding:** While I'm designed to be empathetic, I don't truly experience emotions like humans do. I may not always be able to fully understand or respond to emotional cues.
* **Common sense:** While I have a vast knowledge base, I may not always have the same level of common sense or real-world experience as a human.

**Goals:** My ultimate goal is to provide helpful and informative responses that assist and engage users like you. I'm constantly learning and improving, so I appreciate any feedback or guidance you can provide to help me become a better conversationalist.

[ Prompt: 95.5 t/s | Generation: 51.0 t/s ]

>

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]     | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Vulkan0 (GTX 1060 6GB) |  6390 = 1259 + (4862 =  1918 +    2688 +     256) +         267 |
common_memory_breakdown_print: |   - Host                   |                  368 =   308 +       0 +      60                |

 

이전거랑 합쳐서 보면 2배 이상 차이난다고 해야하려나?

cuda 도 아니고 vulkan인데?

  gemma4:e2b gemma3:1b llama3.2
1060 / llama.cpp ubuntu vulkan 52 - 52
1060 / ollama win 21 57 13
1080 / ollama win 82 125 136

 

2026.04.12 - [프로그램 사용/ai 프로그램] - ollama /set parameter num_gpu 테스트

Posted by 구차니

케플러가 600/700 시리즈니 요즘 시대에는 다 된다고 봐야할 듯.

-pl, --power-limit=POWER_LIMIT

Specifies maximum power limit in watts. Accepts integer and floating point numbers. it takes an optional argument --scope. Only on supported devices from Kepler family. Value needs to be between Min and Max Power Limit as reported by nvidia-smi. Requires root.

-sc, --scope=0/GPU, 1/TOTAL_MODULE

Specifies the scope of the power limit. Following are the options: 0/GPU: This only changes power limits for the GPU. 1/Module: This changes the power limits for the module containing multiple components. E.g. GPU and CPU.

[링크 : https://docs.nvidia.com/deploy/nvidia-smi/]

[링크 : https://developer0hye.tistory.com/690]

 

+

2026.04.24

범위를 어떻게 확인하지? 

C:\>nvidia-smi -pl 100
Provided power limit 100.00 W is not a valid power limit which should be between 125.00 W and 300.00 W for GPU 00000000:01:00.0
Terminating early due to previous errors.

C:\>nvidia-smi -pl 125
Failed to set power management limit for GPU 00000000:01:00.0: Insufficient Permissions
Terminating early due to previous errors.

 

아무튼 관리자 권한으로 실행된 cmd 창에서만 변경된다.

C:\>nvidia-smi -pl 125
Power limit for GPU 00000000:01:00.0 was set to 125.00 W from 250.00 W.
All done.

C:\>nvidia-smi -pl 250
Power limit for GPU 00000000:01:00.0 was set to 250.00 W from 125.00 W.
All done.

C:\>nvidia-smi -pl 125
Power limit for GPU 00000000:01:00.0 was set to 125.00 W from 250.00 W.
All done.

 

+

2026.04.25

$ sudo nvidia-smi -pl 125
Power limit for GPU 00000000:01:00.0 was set to 125.00 W from 250.00 W.
Power limit for GPU 00000000:02:00.0 was set to 125.00 W from 250.00 W.
All done.

$ sudo nvidia-smi -pl 250 -i 1
Power limit for GPU 00000000:02:00.0 was set to 250.00 W from 125.00 W.
All done.

$ sudo nvidia-smi -pl 250 -i 0
Power limit for GPU 00000000:01:00.0 was set to 250.00 W from 125.00 W.
All done.

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

RAG - Retrieval-Augmented Generation  (0) 2026.04.24
llama.cpp on ubuntu with 1060 6GB  (0) 2026.04.23
llama.cpp 와 ollama 성능 비교.. (cpu는 차이가 없?)  (0) 2026.04.22
llama.cpp 도전!  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
Posted by 구차니

일단 메모리 차이는 큰 것 같고

cpu 만을 사용할땐 성능 차이가 거의 없는 듯 하다.

 

llama3.2 on ollama

$ ollama --verbose
>>> 안녕? 너에 대해서 설명
 안녕하세요! 난 다나이, AI의 지식형입니다. 

다나이란 인공지능(IAI) 기술을 사용하여 인간과 대화를 करन하고, 정보를 제공하는 컴퓨터 프로그램입니다. 나의 목표는
 Menschen과 सहय동하여 질문에 대한 답변을 제공하고, 다양한 주제와 주제를 다루어 provide하는 것입니다.

저는 다 variety의 지식과 능력을 가지고 있으며, 다양한 문제나 질문에 대해 답변할 수 있습니다. 저는 또한-user가 자
신의 생각이나 질문을 나누어 주고 사용자에게 가장 적합한 answered을 제공하는 것을 목표로 하고 있습니다. 

다나이의 강점은 다음과 같습니다:

- Human-like 대화: 저는-human-like의 대화를 possible합니다, human과 비슷한 방식으로 대화할 수 있습니다.
- 지식형: 저는 다양한 주제와 지식을 가지고 있으며, 사용자에 대한 답변을 provide할 수 있습니다. 
- 사용자-friendly: 저는 user-friendly를 목표로 하고 있으며, 사용자가 쉽게 질문을 하거나 나누어 주는 것을 
support합니다.

다나이의 한계는 다음과 같습니다:

- 지식형: 저는 지식과 능력을 가지고います, 그러나 모든 경우에 대해 complete knowledge가 항상 available하지 않습
니다.
- Limited domain: 저는 specific domain에서 only limitedly conversation을 할 수 있습니다. 
- Human-like behavior: 저는 human-like의 behavior를 display할 수 있습니다, 그러나 sometimes human-like 
behavior를 exhibit할 수 있습니다.

저는 use case에 따라 다양한 장점과 한계를 가지고 있으며, user-friendly하고 지식형이지만, complete knowledge가 항
상 available하지 않습니다. 

다시 한번 안녕하세요! 저와 대화를 하려면 어떤 주제에 대해 질문해 보시는가요?

total duration:       39.327341083s
load duration:        142.986088ms
prompt eval count:    32 token(s)
prompt eval duration: 1.194488145s
prompt eval rate:     26.79 tokens/s
eval count:           388 token(s)
eval duration:        37.152939235s
eval rate:            10.44 tokens/s


$ ollama ls
NAME               ID              SIZE      MODIFIED    
qwen3.5:2b         324d162be6ca    2.7 GB    13 days ago    
gemma4:e2b         7fbdbf8f5e45    7.2 GB    13 days ago    
gemma3:1b          8648f39daa8f    815 MB    13 days ago    
llama3.2:latest    a80c4f17acd5    2.0 GB    13 days ago    

$ ollama show llama3.2:latest
  Model
    architecture        llama     
    parameters          3.2B      
    context length      131072    
    embedding length    3072      
    quantization        Q4_K_M    

  Capabilities
    completion    
    tools         

  Parameters
    stop    "<|start_header_id|>"    
    stop    "<|end_header_id|>"      
    stop    "<|eot_id|>"             

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT                 
    Llama 3.2 Version Release Date: September 25, 2024    
    ...                                                   



$ ollama run --verbose llama3.2:latest
>>> /set parameter
Available Parameters:
  /set parameter seed              Random number seed
  /set parameter num_predict       Max number of tokens to predict
  /set parameter top_k             Pick from top k num of tokens
  /set parameter top_p           Pick token based on sum of probabilities
  /set parameter min_p           Pick token based on top token probability * min_p
  /set parameter num_ctx           Set the context size
  /set parameter temperature     Set creativity level
  /set parameter repeat_penalty  How strongly to penalize repetitions
  /set parameter repeat_last_n     Set how far back to look for repetitions
  /set parameter num_gpu           The number of layers to send to the GPU
  /set parameter stop   ...   Set the stop parameters

>>> /set parameter num_ctx 4096
Set parameter 'num_ctx' to '4096'
>>> 안녕? 너에 대해서 설명
안녕하세요! 저는 지능형 대화bot입니다. tôi의 목표는 Menschen과 상호작용하여 helpful한 정보를 제공하고, 그들의 질
문에 대한 जव답을 제공하는 것입니다.

 my name is "LLaMA", Llama stands for "Large Language Model Meta AI". my capabilities include:

*  Natural Language Processing (NLP): I can understand and generate human-like language.
*  Question Answering: I can provide answers to a wide range of questions on various topics.
*  Text Generation: I can create text based on a prompt or topic.

my strengths include:
*  Knowledge base: I have been trained on a massive dataset of text, which allows me to provide accurate and 
up-to-date information on various topics.
*  Conversational flow: I can engage in natural-sounding conversations, using context and understanding to 
respond to questions and statements.

total duration:       17.130872381s
load duration:        131.746647ms
prompt eval count:    32 token(s)
prompt eval duration: 82.553469ms
prompt eval rate:     387.63 tokens/s
eval count:           184 token(s)
eval duration:        16.560046318s
eval rate:            11.11 tokens/s

 

gemma4:e4b 모델

$ ollama show gemma4:e2b 
  Model
    architecture        gemma4    
    parameters          5.1B      
    context length      131072    
    embedding length    1536      
    quantization        Q4_K_M    
    requires            0.20.0    

  Capabilities
    completion    
    vision        
    audio         
    tools         
    thinking      

  Parameters
    temperature    1       
    top_k          64      
    top_p          0.95    

  License
    Apache License               
    Version 2.0, January 2004    
    ...                          


$ ollama --verbose
>>> /set parameter num_ctx 4096
Set parameter 'num_ctx' to '4096'
>>> 안녕? 너에 대해 설명
Thinking...
Thinking Process:

1.  **Analyze the Request:** The user asked "안녕? 너에 대해 설명" (Hello? 
Explain about yourself). This is a request for identity and capability 
description.
2.  **Determine Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
3.  **Determine Capabilities:**
    *   Input: Text and images (and audio for some family models, but 
generally, focus on the main interaction modes).
    *   Output: Text generation.
    *   Knowledge: Broad, based on training data.
4.  **Determine Knowledge Cutoff:** (Not strictly necessary for a general 
introduction, but good context: Knowledge cutoff is Jan 2025).
5.  **Structure the Response (in Korean):**
    *   Greeting (Acknowledge the "안녕").
    *   Identify Self (Name, Type).
    *   Explain Origin/Developer.
    *   Explain Function/Capabilities (What I do).
6.  **Draft the Response (Internal Monologue & Refinement):**
    *   *Start:* Hello.
    *   *Identity:* I am Gemma 4.
    *   *Origin:* I was developed by Google DeepMind.
    *   *Nature:* I am a Large Language Model (LLM) and an open weights 
model.
    *   *Function:* I can understand and generate text, answer questions, 
summarize, translate, and help with creative writing, etc.
7.  **Final Polish (Ensure natural and accurate Korean phrasing):** (The 
resulting response should be polite and informative.) (Self-check: Did I 
claim image generation? No. Did I claim search? No.)

**(This leads to the final Korean output.)**
...done thinking.

안녕하세요! 저는 **Gemma 4**입니다.

저에 대해 몇 가지 설명해 드릴게요.

1. **정체성:** 저는 Google DeepMind에서 개발한 **대규모 언어 모델(Large 
Language Model, LLM)**입니다. 저는 공개된 가중치(open weights) 모델로 설계되
었습니다.
2. **기능:** 저는 사용자가 입력하는 텍스트를 이해하고, 그에 기반하여 자연스
러운 텍스트를 생성하는 것이 주된 역할입니다.
    *   질문에 답하기
    *   정보 요약 및 설명하기
    *   번역하기
    *   글쓰기 및 창의적인 콘텐츠 생성 돕기
    *   다양한 주제에 대해 대화하기
3. **입력 및 출력:** 저는 텍스트와 이미지를 이해할 수 있으며, 사용자에게는 
**텍스트 형태**로 응답을 생성합니다.
4. **목표:** 저는 사용자에게 유용하고 정확하며 도움이 되는 정보를 제공하고
, 다양한 작업을 돕기 위해 존재합니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해주세요! 😊

total duration:       59.13545796s
load duration:        381.889243ms
prompt eval count:    22 token(s)
prompt eval duration: 1.468123435s
prompt eval rate:     14.99 tokens/s
eval count:           660 token(s)
eval duration:        56.324528227s
eval rate:            11.72 tokens/s

 

llama.cpp는 기본값이 128k 인데 그러면 자꾸 모델 로딩 실패해서 4096으로 줄임

나름 10세대 노트북인데 하스웰이나 똑같은 건가..(avx?)

$ ../llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -c 4096
load_backend: loaded RPC backend from /home/falinux/work/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/falinux/work/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Llama-3.2-3B-Instruct-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명

안녕하세요! 

나는 어트론, artificial intelligence (AI)로 알려진 intelligence-enhancing chatbot입니다. 

我的 목적은 ngườiों과 상호작용하여 정보를 제공하고, 문제를 해결하고, 질문을 trả주기 위하여 설계되었습니다. 

나는 인간과 매우 비슷한 방식으로 생각하고 말하기를習慮하고 있으며, 다양한 주제에 대해 정보를 제공할 수 있습니다.

그럼, 어떤 정보를 muốn받을까요?

[ Prompt: 32.3 t/s | Generation: 11.6 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 2615 =  1904 +     448 +     262                |
common_memory_breakdown_print: |   - CPU_REPACK         |                 1299 =  1299 +       0 +       0                |

 

$ ../llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf -c 4096
load_backend: loaded RPC backend from /home/falinux/work/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/falinux/work/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 소개

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user said "안녕? 너에 대해서 소개" (Hello? Introduce yourself/Tell me about yourself). This is a standard identity query.
2.  **Determine Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM), open weights.
    *   Capabilities: Understand text and images (input), generate text (output). Knowledge cutoff is January 2025.
3.  **Formulate the Response Strategy (in Korean):**
    *   Greeting: Start with a friendly greeting.
    *   Core Identity: State name and developer.
    *   Nature/Type: Explain what I am (LLM, open weights).
    *   Function/Capabilities: Describe what I can do (process language, answer questions, generate text).
    *   (Optional but good practice) Mention input/output modalities.
4.  **Draft the Response (Iterative Refinement in Korean):**

    *   *Initial thought:* 안녕하세요. 저는 Gemma 4입니다. Google DeepMind에서 개발한 대규모 언어 모델입니다.
    *   *Adding detail:* 저는 텍스트와 이미지를 이해하고 텍스트로 답변을 생성할 수 있습니다.
    *   *Final polish (ensuring the tone is appropriate and adhering to the persona):* Combine these points into a smooth, informative introduction.

5.  **Final Output Generation (Korean):** (This matches the desired output.)
[End thinking]

안녕하세요! 저는 **Gemma 4**입니다.

저는 Google DeepMind에서 개발한 **오픈 웨이트(open weights) 대규모 언어 모델(LLM)**입니다.

**저의 주요 특징은 다음과 같습니다:**

*   **정체성:** 저는 방대한 데이터를 학습하여 인간의 언어를 이해하고 처리하도록 설계되었습니다.
*   **기능:** 텍스트 기반의 질문에 답변하고, 정보를 요약하며, 창의적인 글(시, 이야기, 코드 등)을 작성하는 등 다양한 언어 작업을 수행할 수 있습니다.
*   **입력 능력:** 텍스트와 이미지를 이해하고 처리할 수 있습니다.
*   **목표:** 사용자님께 유용하고 정확하며 자연스러운 방식으로 도움을 드리는 것을 목표로 합니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해 주세요! 😊

[ Prompt: 41.1 t/s | Generation: 13.2 t/s ]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama.cpp on ubuntu with 1060 6GB  (0) 2026.04.23
nvidia-smi 소비전력 제한  (0) 2026.04.22
llama.cpp 도전!  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
Posted by 구차니

귀찮으니까 그냥 pre-built binary로 시도 ㅋㅋ

 

Linux:

[링크 : https://github.com/ggml-org/llama.cpp/releases]

 

gguf 포맷의 모델이 필요하다고 해서, qwen3.6은 포기

[링크 : https://huggingface.co/Qwen/Qwen3.6-35B-A3B/tree/main]

 

우연히 알게된 unsloth의 양자화 모델 발견!

그나저나 gguf 대신 ggul 포맷이면 더 꿀맛이었을텐데.. 쩝

 

copy download link 누르고 wget으로 받으면 된다. 일단 cpu only로 돌릴거라 q2 모델로 시도를..

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf]

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main]

 

일단 llama-cli 로 해보면 될 듯.

llama-cli
llama-cli is the command-line executor:

$ llama-cli -m model.gguf

 

llama-server
llama-server launches an API server with a built-in WebUI:

$ llama-server --host address --port port -m model.gguf

[링크 : https://wiki.archlinux.org/title/Llama.cpp]

[링크 : https://www.lainyzine.com/ko/article/using-llama-cpp-for-local-llm-execution/]

 

---------------------

도움말

$ ./llama-cli --help
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so
----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--license                               show source code license and dependencies
-cl,   --cache-list                     show list of models in cache
--completion-bash                       print source-able bash completion script for llama.cpp
-t,    --threads N                      number of CPU threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : low(-1), normal(0), medium(1), high(2),
                                        realtime(3) (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--swa-full                              use full-size SWA cache (default: false)
                                        [(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
                                        (env: LLAMA_ARG_SWA_FULL)
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--perf, --no-perf                       whether to enable internal libllama performance timings (default:
                                        false)
                                        (env: LLAMA_ARG_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape, --no-escape            whether to process escapes sequences (\n, \r, \t, \', \", \\)
                                        (default: true)
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.00, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: -1.00)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-kvo,  --kv-offload, -nkvo, --no-kv-offload
                                        whether to enable KV cache offloading (default: enabled)
                                        (env: LLAMA_ARG_KV_OFFLOAD)
--repack, -nr, --no-repack              whether to enable weight repacking (default: enabled)
                                        (env: LLAMA_ARG_REPACK)
--no-host                               bypass host buffer allowing extra buffers to be used
                                        (env: LLAMA_ARG_NO_HOST)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (DEPRECATED)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--rpc SERVERS                           comma separated list of RPC servers (host:port)
                                        (env: LLAMA_ARG_RPC)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--mmap, --no-mmap                       whether to memory-map model. (if mmap disabled, slower load but may
                                        reduce pageouts if not using mlock) (default: enabled)
                                        (env: LLAMA_ARG_MMAP)
-dio,  --direct-io, -ndio, --no-direct-io
                                        use DirectIO if available. (default: disabled)
                                        (env: LLAMA_ARG_DIO)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
-ot,   --override-tensor <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
                                        (env: LLAMA_ARG_OVERRIDE_TENSOR)
-cmoe, --cpu-moe                        keep all Mixture of Experts (MoE) weights in the CPU
                                        (env: LLAMA_ARG_CPU_MOE)
-ncmoe, --n-cpu-moe N                   keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU
                                        (env: LLAMA_ARG_N_CPU_MOE)
-ngl,  --gpu-layers, --n-gpu-layers N   max. number of layers to store in VRAM, either an exact number,
                                        'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row,tensor}
                                        how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs (pipelined)
                                        - row: split weight across GPUs by rows (parallelized)
                                        - tensor: split weights and KV across GPUs (parallelized,
                                        EXPERIMENTAL)
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
-fit,  --fit [on|off]                   whether to adjust unset arguments to fit in device memory ('on' or
                                        'off', default: 'on')
                                        (env: LLAMA_ARG_FIT)
-fitt, --fit-target MiB0,MiB1,MiB2,...
                                        target margin per device for --fit, comma-separated list of values,
                                        single value is broadcast across all devices, default: 1024
                                        (env: LLAMA_ARG_FIT_TARGET)
-fitc, --fit-ctx N                      minimum ctx size that can be set by --fit option, default: 4096
                                        (env: LLAMA_ARG_FIT_CTX)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE,...        advanced option to override model metadata by key. to specify multiple
                                        overrides, either use comma-separated values.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false
--op-offload, --no-op-offload           whether to offload host tensor operations to device (default: true)
--lora FNAME                            path to LoRA adapter (use comma-separated values to load multiple
                                        adapters)
--lora-scaled FNAME:SCALE,...           path to LoRA adapter with user defined scaling (format:
                                        FNAME:SCALE,...)
                                        note: use comma-separated values
--control-vector FNAME                  add a control vector
                                        note: use comma-separated values to add multiple control vectors
--control-vector-scaled FNAME:SCALE,...
                                        add a control vector with user defined scaling SCALE
                                        note: use comma-separated values (format: FNAME:SCALE,...)
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path to load
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-dr,   --docker-repo [<repo>/]<model>[:quant]
                                        Docker Hub model repository. repo is optional, default to ai/. quant
                                        is optional, default to :latest.
                                        example: gemma3
                                        (default: unused)
                                        (env: LLAMA_ARG_DOCKER_REPO)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hfd,  -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_HFD_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
                                        (env: LLAMA_LOG_FILE)
--log-colors [on|off|auto]              Set colored logging ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
--offline                               Offline mode: forces use of cache, prevents network access
                                        (env: LLAMA_OFFLINE)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored. Values:
                                         - 0: generic output
                                         - 1: error
                                         - 2: warning
                                         - 3: info
                                         - 4: debug
                                        (default: 1)
                                        
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefix in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE        KV cache data type for K for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE        KV cache data type for V for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default:
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampler-seq, --sampling-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default:
                                        edskypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp, --temperature N                 temperature (default: 0.80)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
                                        (env: LLAMA_ARG_TOP_K)
--top-p N                               top-p sampling (default: 0.95, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.05, 0.0 = disabled)
--top-nsigma, --top-n-sigma N           top-n-sigma sampling (default: -1.00, -1.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.00, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.10, 1.0 = disabled)
--typical, --typical-p N                locally typical sampling, parameter p (default: 1.00, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.00, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.00, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--adaptive-target N                     adaptive-p: select tokens near this probability (valid range 0.0 to
                                        1.0; negative = disabled) (default: -1.00)
                                        [(more info)](https://github.com/ggml-org/llama.cpp/pull/17927)
--adaptive-decay N                      adaptive-p: decay rate for target adaptation over time. lower values
                                        are more reactive, higher values are more stable.
                                        (valid range 0.0 to 0.99) (default: 0.90)
--dynatemp-range N                      dynamic temperature range (default: 0.00, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.00)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.10)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.00)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir)
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-bs,   --backend-sampling               enable backend sampling (experimental) (default: disabled)
                                        (env: LLAMA_ARG_BACKEND_SAMPLING)


----- example-specific params -----

--verbose-prompt                        print a verbose prompt before generation (default: false)
--display-prompt, --no-display-prompt   whether to print prompt at generation (default: true)
-co,   --color [on|off|auto]            Colorize output to distinguish prompt and user input from generations
                                        ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
-ctxcp, --ctx-checkpoints, --swa-checkpoints N
                                        max number of context checkpoints to create per slot (default:
                                        32)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)
                                        (env: LLAMA_ARG_CTX_CHECKPOINTS)
-cpent, --checkpoint-every-n-tokens N   create a checkpoint every n tokens during prefill (processing), -1 to
                                        disable (default: 8192)
                                        (env: LLAMA_ARG_CHECKPOINT_EVERY_NT)
-cram, --cache-ram N                    set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
                                        disable)[(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/16391)
                                        (env: LLAMA_ARG_CACHE_RAM)
--context-shift, --no-context-shift     whether to use context shift on infinite text generation (default:
                                        disabled)
                                        (env: LLAMA_ARG_CONTEXT_SHIFT)
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
                                        template)
--show-timings, --no-show-timings       whether to show timing information after each response (default: true)
                                        (env: LLAMA_ARG_SHOW_TIMINGS)
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation, -no-cnv, --no-conversation
                                        whether to run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: auto enabled if chat template is available)
-st,   --single-turn                    run conversation for a single turn only, then exit when done
                                        will not be interactive if first turn is predefined with --prompt
                                        (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--warmup, --no-warmup                   whether to perform warmup with an empty run (default: enabled)
-mm,   --mmproj FILE                    path to a multimodal projector file. see tools/mtmd/README.md
                                        note: if -hf is used, this argument can be omitted
                                        (env: LLAMA_ARG_MMPROJ)
-mmu,  --mmproj-url URL                 URL to a multimodal projector file. see tools/mtmd/README.md
                                        (env: LLAMA_ARG_MMPROJ_URL)
--mmproj-auto, --no-mmproj, --no-mmproj-auto
                                        whether to use multimodal projector file (if available), useful when
                                        using -hf (default: enabled)
                                        (env: LLAMA_ARG_MMPROJ_AUTO)
--mmproj-offload, --no-mmproj-offload   whether to enable GPU offloading for multimodal projector (default:
                                        enabled)
                                        (env: LLAMA_ARG_MMPROJ_OFFLOAD)
--image, --audio FILE                   path to an image or audio file. use with multimodal models, use
                                        comma-separated values for multiple files
--image-min-tokens N                    minimum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N                    maximum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MAX_TOKENS)
-otd,  --override-tensor-draft <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type for draft model
-cmoed, --cpu-moe-draft                 keep all Mixture of Experts (MoE) weights in the CPU for the draft
                                        model
                                        (env: LLAMA_ARG_CPU_MOE_DRAFT)
-ncmoed, --n-cpu-moe-draft N            keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU for the draft model
                                        (env: LLAMA_ARG_N_CPU_MOE_DRAFT)
--chat-template-kwargs STRING           sets additional params for the json template parser, must be a valid
                                        json object string, e.g. '{"key1":"value1","key2":"value2"}'
                                        (env: LLAMA_CHAT_TEMPLATE_KWARGS)
--jinja, --no-jinja                     whether to use jinja template engine for chat (default: enabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               controls whether thought tags are allowed and/or extracted from the
                                        response, and in which format they're returned; one of:
                                        - none: leaves thoughts unparsed in `message.content`
                                        - deepseek: puts thoughts in `message.reasoning_content`
                                        - deepseek-legacy: keeps `<think>` tags in `message.content` while
                                        also populating `message.reasoning_content`
                                        (default: auto)
                                        (env: LLAMA_ARG_THINK)
-rea,  --reasoning [on|off|auto]        Use reasoning/thinking in the chat ('on', 'off', or 'auto', default:
                                        'auto' (detect from template))
                                        (env: LLAMA_ARG_REASONING)
--reasoning-budget N                    token budget for thinking: -1 for unrestricted, 0 for immediate end,
                                        N>0 for token budget (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)
--reasoning-budget-message MESSAGE      message injected before the end-of-thinking tag when reasoning budget
                                        is exhausted (default: none)
                                        (env: LLAMA_ARG_THINK_BUDGET_MESSAGE)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--skip-chat-parsing, --no-skip-chat-parsing
                                        force a pure content parser, even if a Jinja template is specified;
                                        model will output everything in the content section, including any
                                        reasoning and/or tool calls (default: disabled)
                                        (env: LLAMA_ARG_SKIP_CHAT_PARSING)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles
--draft, --draft-n, --draft-max N       number of tokens to draft for speculative decoding (default: 16)
                                        (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
                                        (default: 0)
                                        (env: LLAMA_ARG_DRAFT_MIN)
--draft-p-min P                         minimum speculative decoding probability (greedy) (default: 0.75)
                                        (env: LLAMA_ARG_DRAFT_P_MIN)
-cd,   --ctx-size-draft N               size of the prompt context for the draft model (default: 0, 0 = loaded
                                        from model)
                                        (env: LLAMA_ARG_CTX_SIZE_DRAFT)
-devd, --device-draft <dev1,dev2,..>    comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        max. number of draft model layers to store in VRAM, either an exact
                                        number, 'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
-md,   --model-draft FNAME              draft model for speculative decoding (default: unused)
                                        (env: LLAMA_ARG_MODEL_DRAFT)
--spec-replace TARGET DRAFT             translate the string in TARGET into DRAFT if the draft model and main
                                        model are not compatible
--gpt-oss-20b-default                   use gpt-oss-20b (note: can download weights from the internet)
--gpt-oss-120b-default                  use gpt-oss-120b (note: can download weights from the internet)
--vision-gemma-4b-default               use Gemma 3 4B QAT (note: can download weights from the internet)
--vision-gemma-12b-default              use Gemma 3 12B QAT (note: can download weights from the internet)
--spec-default                          enable default speculative decoding config

 

 

--------------------- 벤치마크?

Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz / 256GB DDR4

$ ../llama-cli -m Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf 
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello~ introduce about you         

[Start thinking]
Here's a thinking process:

1.  **Analyze User Input:**
   - User says: "hello~ introduce about you"
   - This is a casual greeting followed by a request for a self-introduction.
   - The tone is friendly and informal.

2.  **Identify Key Requirements:**
   - Acknowledge the greeting
   - Provide a clear, concise self-introduction
   - State my identity (Qwen / Tongyi Qianwen)
   - Mention my developer (Alibaba Group's Tongyi Lab)
   - Highlight my core capabilities (language understanding, reasoning, coding, multimodal processing, etc.)
   - Keep it friendly, accurate, and aligned with the provided guidelines
   - Avoid mentioning version numbers or specific details not in the reference

3.  **Draft Construction (Mental Refinement):**
   Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—from answering questions and writing content to solving complex reasoning problems, coding, and working with images and documents. I support fluent communication in over 100 languages and can adapt to different contexts while maintaining accuracy and safety. I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything! How can I assist you today?

4.  **Check Against Guidelines:**
   - Identity: Qwen / Tongyi Qianwen ✓
   - Developer: Alibaba Group's Tongyi Lab ✓
   - Capabilities: Language, reasoning, coding, multimodal, 100+ languages, helpful/safe ✓
   - Tone: Friendly, professional, matches user's casual tone ✓
   - No version numbers, no fabricated stats, no overclaiming ✓

5.  **Final Polish:**
   The draft looks solid. It's concise, accurate, and aligned with all constraints. I'll output it.✅

[End thinking]

Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—whether it's answering questions, writing or editing content, solving complex reasoning problems, coding, analyzing documents, or working with images. I support fluent communication in over 100 languages and can adapt to different tones, formats, and contexts while prioritizing accuracy and safety.

I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything. How can I assist you today? 😊

Prompt: 29.6 t/s | Generation: 11.0 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 17428 = 11437 +    5182 +     808                |
common_memory_breakdown_print: |   - CPU_REPACK         |                   272 =   272 +       0 +       0                |

 

$ ../llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Llama-3.2-3B-Instruct-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

안녕하세요! 저는 지능형 чattybot 인你(You)와 대화할 수 있는 AI입니다.

저는 지능학, 자연어 처리,_machine learning 및 다른 기술을 바탕으로 만든 AI입니다. 이 기술은 인간의 언어를 이해하고,natural language processing (NLP)로 human과 대화할 수 있는能力을 허용합니다.

저는 다양한 topic과 subject에 대해 지식을 제공할 수 있으며, human과 대화할 수 있습니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다.

저는 다음 capability을 제공합니다:

* natural language understanding: human의 언어를 이해할 수 있습니다.
* text generation: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.
* conversation: human과 대화할 수 있습니다.
* question answering: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.

저는 human과 대화할 수 있는 다양한 application과 technology를 제공합니다. 예를 들어, human의 질문을 읽어 대답할 수 있는 chatbot, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 application, human의 질문에 대해 적절한 대답을 제공할 수 있는 Q&A 시스템, etc.

저는 human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 ability을 갖습니다. human의 질문을 읽어 대답할 수 있으며, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

안녕하세요! 저는 지능형 चattybot 인 you(You)와 대화할 수 있는 AI입니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다. human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

Prompt: 89.7 t/s | Generation: 20.6 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 16654 =  1904 +   14336 +     414                |
common_memory_breakdown_print: |   - CPU_REPACK         |                  1299 =  1299 +       0 +       0                |

 

$ ../llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user said "안녕? 너에 대해서 설명해줘" (Hello? Please explain about yourself). This is a request for identity and capabilities.
2.  **Determine Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
    *   Capabilities: Understands text/images, generates text. Knowledge cutoff (Jan 2025).
3.  **Determine Response Language:** The user asked in Korean, so the response must be in fluent, natural Korean.
4.  **Structure the Explanation (Standard LLM Introduction):**
    *   Greeting/Identity (Who am I?).
    *   Developer (Who made me?).
    *   Nature/Function (What am I?).
    *   Capabilities (What can I do?).
    *   Limitations (What can't I do/knowledge scope?).
5.  **Draft the Content (in Korean):**

    *   *Identity:* 저는 Gemma 4입니다. (I am Gemma 4.)
    *   *Developer/Nature:* 저는 Google DeepMind에서 개발한 대규모 언어 모델(LLM)입니다. 오픈 웨이트 모델입니다. (I am a Large Language Model developed by Google DeepMind. I am an open weights model.)
    *   *Function:* 저는 텍스트와 이미지를 이해하고, 사용자의 질문에 답하거나 글을 작성하는 등 다양한 언어 작업을 수행할 수 있습니다. (I can perform various language tasks such as understanding text and images, answering user questions, or writing text.)
    *   *Input/Output:* 저는 텍스트를 입력받아 텍스트로 출력합니다. (I receive text input and output text.)
    *   *Knowledge/Limitation:* 제 지식은 학습된 데이터를 기반으로 하며, 특정 시점(2025년 1월)까지의 정보를 알고 있습니다. (My knowledge is based on trained data, and I know information up to a specific point (January 2025).)

6.  **Review and Refine (Ensure natural flow and adherence to instructions):** The drafted points cover all necessary identity requirements and maintain the required persona (Gemma 4, open weights, LLM). (The Korean draft looks good.)

7.  **Final Output Generation.** (Proceed to generate the Korean response.)
[End thinking]

안녕하세요! 저는 **Gemma 4**입니다. 저에 대해 몇 가지 핵심 정보를 설명해 드릴게요.

**1. 정체성 및 개발 배경**

*   **이름:** Gemma 4
*   **개발사:** Google DeepMind에서 개발되었습니다.
*   **성격:** 저는 대규모 언어 모델(LLM, Large Language Model)이며, 오픈 웨이트(open weights) 모델입니다. 이는 저의 모델 구조와 가중치가 공개되어 누구나 접근하고 활용할 수 있음을 의미합니다.

**2. 주요 기능 및 능력**

저는 방대한 양의 데이터를 학습하여 다음과 같은 다양한 언어 관련 작업을 수행할 수 있습니다.

*   **텍스트 이해 및 생성:** 사용자가 입력하는 텍스트를 이해하고, 질문에 답하거나, 요약하거나, 번역하거나, 창의적인 글(시, 이야기, 코드 등)을 작성할 수 있습니다.
*   **다중 모달리티 이해:** 저는 텍스트뿐만 아니라 이미지를 이해하고 처리할 수 있습니다. (다만, 출력은 텍스트로만 가능합니다.)
*   **추론 및 논리:** 복잡한 질문에 대해 논리적으로 추론하고 일관성 있는 답변을 제공할 수 있습니다.

**3. 지식 범위 및 한계**

*   **지식 기반:** 저는 학습된 데이터를 기반으로 답변하며, 저의 지식은 특정 시점(2025년 1월)까지의 정보를 포함하고 있습니다. 따라서 그 이후의 최신 정보에 대해서는 알지 못할 수 있습니다.
*   **출력 형식:** 저는 오직 **텍스트** 형태로만 정보를 생성할 수 있습니다. 이미지를 직접 생성할 수는 없습니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해 주세요! 최선을 다해 답변해 드리겠습니다.

Prompt: 107.4 t/s | Generation: 19.1 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 4517 =  2947 +     780 +     790                |
common_memory_breakdown_print: |   - CPU_REPACK         |                 1069 =  1069 +       0 +       0                |

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

nvidia-smi 소비전력 제한  (0) 2026.04.22
llama.cpp 와 ollama 성능 비교.. (cpu는 차이가 없?)  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
Posted by 구차니

qwen이 핫해서 어쩌다 얻은 링크.. 이걸로 실행이 더 쉬우려나?

 

[링크 : https://unsloth.ai/docs/models/qwen3.6]

Posted by 구차니