일단 메모리 차이는 큰 것 같고

cpu 만을 사용할땐 성능 차이가 거의 없는 듯 하다.

 

llama3.2 on ollama

$ ollama --verbose
>>> 안녕? 너에 대해서 설명
 안녕하세요! 난 다나이, AI의 지식형입니다. 

다나이란 인공지능(IAI) 기술을 사용하여 인간과 대화를 करन하고, 정보를 제공하는 컴퓨터 프로그램입니다. 나의 목표는
 Menschen과 सहय동하여 질문에 대한 답변을 제공하고, 다양한 주제와 주제를 다루어 provide하는 것입니다.

저는 다 variety의 지식과 능력을 가지고 있으며, 다양한 문제나 질문에 대해 답변할 수 있습니다. 저는 또한-user가 자
신의 생각이나 질문을 나누어 주고 사용자에게 가장 적합한 answered을 제공하는 것을 목표로 하고 있습니다. 

다나이의 강점은 다음과 같습니다:

- Human-like 대화: 저는-human-like의 대화를 possible합니다, human과 비슷한 방식으로 대화할 수 있습니다.
- 지식형: 저는 다양한 주제와 지식을 가지고 있으며, 사용자에 대한 답변을 provide할 수 있습니다. 
- 사용자-friendly: 저는 user-friendly를 목표로 하고 있으며, 사용자가 쉽게 질문을 하거나 나누어 주는 것을 
support합니다.

다나이의 한계는 다음과 같습니다:

- 지식형: 저는 지식과 능력을 가지고います, 그러나 모든 경우에 대해 complete knowledge가 항상 available하지 않습
니다.
- Limited domain: 저는 specific domain에서 only limitedly conversation을 할 수 있습니다. 
- Human-like behavior: 저는 human-like의 behavior를 display할 수 있습니다, 그러나 sometimes human-like 
behavior를 exhibit할 수 있습니다.

저는 use case에 따라 다양한 장점과 한계를 가지고 있으며, user-friendly하고 지식형이지만, complete knowledge가 항
상 available하지 않습니다. 

다시 한번 안녕하세요! 저와 대화를 하려면 어떤 주제에 대해 질문해 보시는가요?

total duration:       39.327341083s
load duration:        142.986088ms
prompt eval count:    32 token(s)
prompt eval duration: 1.194488145s
prompt eval rate:     26.79 tokens/s
eval count:           388 token(s)
eval duration:        37.152939235s
eval rate:            10.44 tokens/s


$ ollama ls
NAME               ID              SIZE      MODIFIED    
qwen3.5:2b         324d162be6ca    2.7 GB    13 days ago    
gemma4:e2b         7fbdbf8f5e45    7.2 GB    13 days ago    
gemma3:1b          8648f39daa8f    815 MB    13 days ago    
llama3.2:latest    a80c4f17acd5    2.0 GB    13 days ago    

$ ollama show llama3.2:latest
  Model
    architecture        llama     
    parameters          3.2B      
    context length      131072    
    embedding length    3072      
    quantization        Q4_K_M    

  Capabilities
    completion    
    tools         

  Parameters
    stop    "<|start_header_id|>"    
    stop    "<|end_header_id|>"      
    stop    "<|eot_id|>"             

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT                 
    Llama 3.2 Version Release Date: September 25, 2024    
    ...                                                   



$ ollama run --verbose llama3.2:latest
>>> /set parameter
Available Parameters:
  /set parameter seed              Random number seed
  /set parameter num_predict       Max number of tokens to predict
  /set parameter top_k             Pick from top k num of tokens
  /set parameter top_p           Pick token based on sum of probabilities
  /set parameter min_p           Pick token based on top token probability * min_p
  /set parameter num_ctx           Set the context size
  /set parameter temperature     Set creativity level
  /set parameter repeat_penalty  How strongly to penalize repetitions
  /set parameter repeat_last_n     Set how far back to look for repetitions
  /set parameter num_gpu           The number of layers to send to the GPU
  /set parameter stop   ...   Set the stop parameters

>>> /set parameter num_ctx 4096
Set parameter 'num_ctx' to '4096'
>>> 안녕? 너에 대해서 설명
안녕하세요! 저는 지능형 대화bot입니다. tôi의 목표는 Menschen과 상호작용하여 helpful한 정보를 제공하고, 그들의 질
문에 대한 जव답을 제공하는 것입니다.

 my name is "LLaMA", Llama stands for "Large Language Model Meta AI". my capabilities include:

*  Natural Language Processing (NLP): I can understand and generate human-like language.
*  Question Answering: I can provide answers to a wide range of questions on various topics.
*  Text Generation: I can create text based on a prompt or topic.

my strengths include:
*  Knowledge base: I have been trained on a massive dataset of text, which allows me to provide accurate and 
up-to-date information on various topics.
*  Conversational flow: I can engage in natural-sounding conversations, using context and understanding to 
respond to questions and statements.

total duration:       17.130872381s
load duration:        131.746647ms
prompt eval count:    32 token(s)
prompt eval duration: 82.553469ms
prompt eval rate:     387.63 tokens/s
eval count:           184 token(s)
eval duration:        16.560046318s
eval rate:            11.11 tokens/s

 

gemma4:e4b 모델

$ ollama show gemma4:e2b 
  Model
    architecture        gemma4    
    parameters          5.1B      
    context length      131072    
    embedding length    1536      
    quantization        Q4_K_M    
    requires            0.20.0    

  Capabilities
    completion    
    vision        
    audio         
    tools         
    thinking      

  Parameters
    temperature    1       
    top_k          64      
    top_p          0.95    

  License
    Apache License               
    Version 2.0, January 2004    
    ...                          


$ ollama --verbose
>>> /set parameter num_ctx 4096
Set parameter 'num_ctx' to '4096'
>>> 안녕? 너에 대해 설명
Thinking...
Thinking Process:

1.  **Analyze the Request:** The user asked "안녕? 너에 대해 설명" (Hello? 
Explain about yourself). This is a request for identity and capability 
description.
2.  **Determine Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
3.  **Determine Capabilities:**
    *   Input: Text and images (and audio for some family models, but 
generally, focus on the main interaction modes).
    *   Output: Text generation.
    *   Knowledge: Broad, based on training data.
4.  **Determine Knowledge Cutoff:** (Not strictly necessary for a general 
introduction, but good context: Knowledge cutoff is Jan 2025).
5.  **Structure the Response (in Korean):**
    *   Greeting (Acknowledge the "안녕").
    *   Identify Self (Name, Type).
    *   Explain Origin/Developer.
    *   Explain Function/Capabilities (What I do).
6.  **Draft the Response (Internal Monologue & Refinement):**
    *   *Start:* Hello.
    *   *Identity:* I am Gemma 4.
    *   *Origin:* I was developed by Google DeepMind.
    *   *Nature:* I am a Large Language Model (LLM) and an open weights 
model.
    *   *Function:* I can understand and generate text, answer questions, 
summarize, translate, and help with creative writing, etc.
7.  **Final Polish (Ensure natural and accurate Korean phrasing):** (The 
resulting response should be polite and informative.) (Self-check: Did I 
claim image generation? No. Did I claim search? No.)

**(This leads to the final Korean output.)**
...done thinking.

안녕하세요! 저는 **Gemma 4**입니다.

저에 대해 몇 가지 설명해 드릴게요.

1. **정체성:** 저는 Google DeepMind에서 개발한 **대규모 언어 모델(Large 
Language Model, LLM)**입니다. 저는 공개된 가중치(open weights) 모델로 설계되
었습니다.
2. **기능:** 저는 사용자가 입력하는 텍스트를 이해하고, 그에 기반하여 자연스
러운 텍스트를 생성하는 것이 주된 역할입니다.
    *   질문에 답하기
    *   정보 요약 및 설명하기
    *   번역하기
    *   글쓰기 및 창의적인 콘텐츠 생성 돕기
    *   다양한 주제에 대해 대화하기
3. **입력 및 출력:** 저는 텍스트와 이미지를 이해할 수 있으며, 사용자에게는 
**텍스트 형태**로 응답을 생성합니다.
4. **목표:** 저는 사용자에게 유용하고 정확하며 도움이 되는 정보를 제공하고
, 다양한 작업을 돕기 위해 존재합니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해주세요! 😊

total duration:       59.13545796s
load duration:        381.889243ms
prompt eval count:    22 token(s)
prompt eval duration: 1.468123435s
prompt eval rate:     14.99 tokens/s
eval count:           660 token(s)
eval duration:        56.324528227s
eval rate:            11.72 tokens/s

 

llama.cpp는 기본값이 128k 인데 그러면 자꾸 모델 로딩 실패해서 4096으로 줄임

나름 10세대 노트북인데 하스웰이나 똑같은 건가..(avx?)

$ ../llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -c 4096
load_backend: loaded RPC backend from /home/falinux/work/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/falinux/work/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Llama-3.2-3B-Instruct-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명

안녕하세요! 

나는 어트론, artificial intelligence (AI)로 알려진 intelligence-enhancing chatbot입니다. 

我的 목적은 ngườiों과 상호작용하여 정보를 제공하고, 문제를 해결하고, 질문을 trả주기 위하여 설계되었습니다. 

나는 인간과 매우 비슷한 방식으로 생각하고 말하기를習慮하고 있으며, 다양한 주제에 대해 정보를 제공할 수 있습니다.

그럼, 어떤 정보를 muốn받을까요?

[ Prompt: 32.3 t/s | Generation: 11.6 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 2615 =  1904 +     448 +     262                |
common_memory_breakdown_print: |   - CPU_REPACK         |                 1299 =  1299 +       0 +       0                |

 

$ ../llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf -c 4096
load_backend: loaded RPC backend from /home/falinux/work/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/falinux/work/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 소개

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user said "안녕? 너에 대해서 소개" (Hello? Introduce yourself/Tell me about yourself). This is a standard identity query.
2.  **Determine Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM), open weights.
    *   Capabilities: Understand text and images (input), generate text (output). Knowledge cutoff is January 2025.
3.  **Formulate the Response Strategy (in Korean):**
    *   Greeting: Start with a friendly greeting.
    *   Core Identity: State name and developer.
    *   Nature/Type: Explain what I am (LLM, open weights).
    *   Function/Capabilities: Describe what I can do (process language, answer questions, generate text).
    *   (Optional but good practice) Mention input/output modalities.
4.  **Draft the Response (Iterative Refinement in Korean):**

    *   *Initial thought:* 안녕하세요. 저는 Gemma 4입니다. Google DeepMind에서 개발한 대규모 언어 모델입니다.
    *   *Adding detail:* 저는 텍스트와 이미지를 이해하고 텍스트로 답변을 생성할 수 있습니다.
    *   *Final polish (ensuring the tone is appropriate and adhering to the persona):* Combine these points into a smooth, informative introduction.

5.  **Final Output Generation (Korean):** (This matches the desired output.)
[End thinking]

안녕하세요! 저는 **Gemma 4**입니다.

저는 Google DeepMind에서 개발한 **오픈 웨이트(open weights) 대규모 언어 모델(LLM)**입니다.

**저의 주요 특징은 다음과 같습니다:**

*   **정체성:** 저는 방대한 데이터를 학습하여 인간의 언어를 이해하고 처리하도록 설계되었습니다.
*   **기능:** 텍스트 기반의 질문에 답변하고, 정보를 요약하며, 창의적인 글(시, 이야기, 코드 등)을 작성하는 등 다양한 언어 작업을 수행할 수 있습니다.
*   **입력 능력:** 텍스트와 이미지를 이해하고 처리할 수 있습니다.
*   **목표:** 사용자님께 유용하고 정확하며 자연스러운 방식으로 도움을 드리는 것을 목표로 합니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해 주세요! 😊

[ Prompt: 41.1 t/s | Generation: 13.2 t/s ]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama.cpp 도전!  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
Posted by 구차니

귀찮으니까 그냥 pre-built binary로 시도 ㅋㅋ

 

Linux:

[링크 : https://github.com/ggml-org/llama.cpp/releases]

 

gguf 포맷의 모델이 필요하다고 해서, qwen3.6은 포기

[링크 : https://huggingface.co/Qwen/Qwen3.6-35B-A3B/tree/main]

 

우연히 알게된 unsloth의 양자화 모델 발견!

그나저나 gguf 대신 ggul 포맷이면 더 꿀맛이었을텐데.. 쩝

 

copy download link 누르고 wget으로 받으면 된다. 일단 cpu only로 돌릴거라 q2 모델로 시도를..

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf]

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main]

 

일단 llama-cli 로 해보면 될 듯.

llama-cli
llama-cli is the command-line executor:

$ llama-cli -m model.gguf

 

llama-server
llama-server launches an API server with a built-in WebUI:

$ llama-server --host address --port port -m model.gguf

[링크 : https://wiki.archlinux.org/title/Llama.cpp]

[링크 : https://www.lainyzine.com/ko/article/using-llama-cpp-for-local-llm-execution/]

 

---------------------

도움말

$ ./llama-cli --help
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so
----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--license                               show source code license and dependencies
-cl,   --cache-list                     show list of models in cache
--completion-bash                       print source-able bash completion script for llama.cpp
-t,    --threads N                      number of CPU threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : low(-1), normal(0), medium(1), high(2),
                                        realtime(3) (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--swa-full                              use full-size SWA cache (default: false)
                                        [(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
                                        (env: LLAMA_ARG_SWA_FULL)
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--perf, --no-perf                       whether to enable internal libllama performance timings (default:
                                        false)
                                        (env: LLAMA_ARG_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape, --no-escape            whether to process escapes sequences (\n, \r, \t, \', \", \\)
                                        (default: true)
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.00, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: -1.00)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-kvo,  --kv-offload, -nkvo, --no-kv-offload
                                        whether to enable KV cache offloading (default: enabled)
                                        (env: LLAMA_ARG_KV_OFFLOAD)
--repack, -nr, --no-repack              whether to enable weight repacking (default: enabled)
                                        (env: LLAMA_ARG_REPACK)
--no-host                               bypass host buffer allowing extra buffers to be used
                                        (env: LLAMA_ARG_NO_HOST)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (DEPRECATED)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--rpc SERVERS                           comma separated list of RPC servers (host:port)
                                        (env: LLAMA_ARG_RPC)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--mmap, --no-mmap                       whether to memory-map model. (if mmap disabled, slower load but may
                                        reduce pageouts if not using mlock) (default: enabled)
                                        (env: LLAMA_ARG_MMAP)
-dio,  --direct-io, -ndio, --no-direct-io
                                        use DirectIO if available. (default: disabled)
                                        (env: LLAMA_ARG_DIO)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
-ot,   --override-tensor <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
                                        (env: LLAMA_ARG_OVERRIDE_TENSOR)
-cmoe, --cpu-moe                        keep all Mixture of Experts (MoE) weights in the CPU
                                        (env: LLAMA_ARG_CPU_MOE)
-ncmoe, --n-cpu-moe N                   keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU
                                        (env: LLAMA_ARG_N_CPU_MOE)
-ngl,  --gpu-layers, --n-gpu-layers N   max. number of layers to store in VRAM, either an exact number,
                                        'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row,tensor}
                                        how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs (pipelined)
                                        - row: split weight across GPUs by rows (parallelized)
                                        - tensor: split weights and KV across GPUs (parallelized,
                                        EXPERIMENTAL)
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
-fit,  --fit [on|off]                   whether to adjust unset arguments to fit in device memory ('on' or
                                        'off', default: 'on')
                                        (env: LLAMA_ARG_FIT)
-fitt, --fit-target MiB0,MiB1,MiB2,...
                                        target margin per device for --fit, comma-separated list of values,
                                        single value is broadcast across all devices, default: 1024
                                        (env: LLAMA_ARG_FIT_TARGET)
-fitc, --fit-ctx N                      minimum ctx size that can be set by --fit option, default: 4096
                                        (env: LLAMA_ARG_FIT_CTX)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE,...        advanced option to override model metadata by key. to specify multiple
                                        overrides, either use comma-separated values.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false
--op-offload, --no-op-offload           whether to offload host tensor operations to device (default: true)
--lora FNAME                            path to LoRA adapter (use comma-separated values to load multiple
                                        adapters)
--lora-scaled FNAME:SCALE,...           path to LoRA adapter with user defined scaling (format:
                                        FNAME:SCALE,...)
                                        note: use comma-separated values
--control-vector FNAME                  add a control vector
                                        note: use comma-separated values to add multiple control vectors
--control-vector-scaled FNAME:SCALE,...
                                        add a control vector with user defined scaling SCALE
                                        note: use comma-separated values (format: FNAME:SCALE,...)
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path to load
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-dr,   --docker-repo [<repo>/]<model>[:quant]
                                        Docker Hub model repository. repo is optional, default to ai/. quant
                                        is optional, default to :latest.
                                        example: gemma3
                                        (default: unused)
                                        (env: LLAMA_ARG_DOCKER_REPO)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hfd,  -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_HFD_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
                                        (env: LLAMA_LOG_FILE)
--log-colors [on|off|auto]              Set colored logging ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
--offline                               Offline mode: forces use of cache, prevents network access
                                        (env: LLAMA_OFFLINE)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored. Values:
                                         - 0: generic output
                                         - 1: error
                                         - 2: warning
                                         - 3: info
                                         - 4: debug
                                        (default: 1)
                                        
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefix in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE        KV cache data type for K for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE        KV cache data type for V for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default:
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampler-seq, --sampling-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default:
                                        edskypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp, --temperature N                 temperature (default: 0.80)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
                                        (env: LLAMA_ARG_TOP_K)
--top-p N                               top-p sampling (default: 0.95, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.05, 0.0 = disabled)
--top-nsigma, --top-n-sigma N           top-n-sigma sampling (default: -1.00, -1.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.00, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.10, 1.0 = disabled)
--typical, --typical-p N                locally typical sampling, parameter p (default: 1.00, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.00, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.00, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--adaptive-target N                     adaptive-p: select tokens near this probability (valid range 0.0 to
                                        1.0; negative = disabled) (default: -1.00)
                                        [(more info)](https://github.com/ggml-org/llama.cpp/pull/17927)
--adaptive-decay N                      adaptive-p: decay rate for target adaptation over time. lower values
                                        are more reactive, higher values are more stable.
                                        (valid range 0.0 to 0.99) (default: 0.90)
--dynatemp-range N                      dynamic temperature range (default: 0.00, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.00)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.10)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.00)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir)
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-bs,   --backend-sampling               enable backend sampling (experimental) (default: disabled)
                                        (env: LLAMA_ARG_BACKEND_SAMPLING)


----- example-specific params -----

--verbose-prompt                        print a verbose prompt before generation (default: false)
--display-prompt, --no-display-prompt   whether to print prompt at generation (default: true)
-co,   --color [on|off|auto]            Colorize output to distinguish prompt and user input from generations
                                        ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
-ctxcp, --ctx-checkpoints, --swa-checkpoints N
                                        max number of context checkpoints to create per slot (default:
                                        32)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)
                                        (env: LLAMA_ARG_CTX_CHECKPOINTS)
-cpent, --checkpoint-every-n-tokens N   create a checkpoint every n tokens during prefill (processing), -1 to
                                        disable (default: 8192)
                                        (env: LLAMA_ARG_CHECKPOINT_EVERY_NT)
-cram, --cache-ram N                    set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
                                        disable)[(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/16391)
                                        (env: LLAMA_ARG_CACHE_RAM)
--context-shift, --no-context-shift     whether to use context shift on infinite text generation (default:
                                        disabled)
                                        (env: LLAMA_ARG_CONTEXT_SHIFT)
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
                                        template)
--show-timings, --no-show-timings       whether to show timing information after each response (default: true)
                                        (env: LLAMA_ARG_SHOW_TIMINGS)
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation, -no-cnv, --no-conversation
                                        whether to run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: auto enabled if chat template is available)
-st,   --single-turn                    run conversation for a single turn only, then exit when done
                                        will not be interactive if first turn is predefined with --prompt
                                        (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--warmup, --no-warmup                   whether to perform warmup with an empty run (default: enabled)
-mm,   --mmproj FILE                    path to a multimodal projector file. see tools/mtmd/README.md
                                        note: if -hf is used, this argument can be omitted
                                        (env: LLAMA_ARG_MMPROJ)
-mmu,  --mmproj-url URL                 URL to a multimodal projector file. see tools/mtmd/README.md
                                        (env: LLAMA_ARG_MMPROJ_URL)
--mmproj-auto, --no-mmproj, --no-mmproj-auto
                                        whether to use multimodal projector file (if available), useful when
                                        using -hf (default: enabled)
                                        (env: LLAMA_ARG_MMPROJ_AUTO)
--mmproj-offload, --no-mmproj-offload   whether to enable GPU offloading for multimodal projector (default:
                                        enabled)
                                        (env: LLAMA_ARG_MMPROJ_OFFLOAD)
--image, --audio FILE                   path to an image or audio file. use with multimodal models, use
                                        comma-separated values for multiple files
--image-min-tokens N                    minimum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N                    maximum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MAX_TOKENS)
-otd,  --override-tensor-draft <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type for draft model
-cmoed, --cpu-moe-draft                 keep all Mixture of Experts (MoE) weights in the CPU for the draft
                                        model
                                        (env: LLAMA_ARG_CPU_MOE_DRAFT)
-ncmoed, --n-cpu-moe-draft N            keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU for the draft model
                                        (env: LLAMA_ARG_N_CPU_MOE_DRAFT)
--chat-template-kwargs STRING           sets additional params for the json template parser, must be a valid
                                        json object string, e.g. '{"key1":"value1","key2":"value2"}'
                                        (env: LLAMA_CHAT_TEMPLATE_KWARGS)
--jinja, --no-jinja                     whether to use jinja template engine for chat (default: enabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               controls whether thought tags are allowed and/or extracted from the
                                        response, and in which format they're returned; one of:
                                        - none: leaves thoughts unparsed in `message.content`
                                        - deepseek: puts thoughts in `message.reasoning_content`
                                        - deepseek-legacy: keeps `<think>` tags in `message.content` while
                                        also populating `message.reasoning_content`
                                        (default: auto)
                                        (env: LLAMA_ARG_THINK)
-rea,  --reasoning [on|off|auto]        Use reasoning/thinking in the chat ('on', 'off', or 'auto', default:
                                        'auto' (detect from template))
                                        (env: LLAMA_ARG_REASONING)
--reasoning-budget N                    token budget for thinking: -1 for unrestricted, 0 for immediate end,
                                        N>0 for token budget (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)
--reasoning-budget-message MESSAGE      message injected before the end-of-thinking tag when reasoning budget
                                        is exhausted (default: none)
                                        (env: LLAMA_ARG_THINK_BUDGET_MESSAGE)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--skip-chat-parsing, --no-skip-chat-parsing
                                        force a pure content parser, even if a Jinja template is specified;
                                        model will output everything in the content section, including any
                                        reasoning and/or tool calls (default: disabled)
                                        (env: LLAMA_ARG_SKIP_CHAT_PARSING)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles
--draft, --draft-n, --draft-max N       number of tokens to draft for speculative decoding (default: 16)
                                        (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
                                        (default: 0)
                                        (env: LLAMA_ARG_DRAFT_MIN)
--draft-p-min P                         minimum speculative decoding probability (greedy) (default: 0.75)
                                        (env: LLAMA_ARG_DRAFT_P_MIN)
-cd,   --ctx-size-draft N               size of the prompt context for the draft model (default: 0, 0 = loaded
                                        from model)
                                        (env: LLAMA_ARG_CTX_SIZE_DRAFT)
-devd, --device-draft <dev1,dev2,..>    comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        max. number of draft model layers to store in VRAM, either an exact
                                        number, 'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
-md,   --model-draft FNAME              draft model for speculative decoding (default: unused)
                                        (env: LLAMA_ARG_MODEL_DRAFT)
--spec-replace TARGET DRAFT             translate the string in TARGET into DRAFT if the draft model and main
                                        model are not compatible
--gpt-oss-20b-default                   use gpt-oss-20b (note: can download weights from the internet)
--gpt-oss-120b-default                  use gpt-oss-120b (note: can download weights from the internet)
--vision-gemma-4b-default               use Gemma 3 4B QAT (note: can download weights from the internet)
--vision-gemma-12b-default              use Gemma 3 12B QAT (note: can download weights from the internet)
--spec-default                          enable default speculative decoding config

 

 

--------------------- 벤치마크?

Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz / 256GB DDR4

$ ../llama-cli -m Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf 
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello~ introduce about you         

[Start thinking]
Here's a thinking process:

1.  **Analyze User Input:**
   - User says: "hello~ introduce about you"
   - This is a casual greeting followed by a request for a self-introduction.
   - The tone is friendly and informal.

2.  **Identify Key Requirements:**
   - Acknowledge the greeting
   - Provide a clear, concise self-introduction
   - State my identity (Qwen / Tongyi Qianwen)
   - Mention my developer (Alibaba Group's Tongyi Lab)
   - Highlight my core capabilities (language understanding, reasoning, coding, multimodal processing, etc.)
   - Keep it friendly, accurate, and aligned with the provided guidelines
   - Avoid mentioning version numbers or specific details not in the reference

3.  **Draft Construction (Mental Refinement):**
   Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—from answering questions and writing content to solving complex reasoning problems, coding, and working with images and documents. I support fluent communication in over 100 languages and can adapt to different contexts while maintaining accuracy and safety. I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything! How can I assist you today?

4.  **Check Against Guidelines:**
   - Identity: Qwen / Tongyi Qianwen ✓
   - Developer: Alibaba Group's Tongyi Lab ✓
   - Capabilities: Language, reasoning, coding, multimodal, 100+ languages, helpful/safe ✓
   - Tone: Friendly, professional, matches user's casual tone ✓
   - No version numbers, no fabricated stats, no overclaiming ✓

5.  **Final Polish:**
   The draft looks solid. It's concise, accurate, and aligned with all constraints. I'll output it.✅

[End thinking]

Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—whether it's answering questions, writing or editing content, solving complex reasoning problems, coding, analyzing documents, or working with images. I support fluent communication in over 100 languages and can adapt to different tones, formats, and contexts while prioritizing accuracy and safety.

I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything. How can I assist you today? 😊

Prompt: 29.6 t/s | Generation: 11.0 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 17428 = 11437 +    5182 +     808                |
common_memory_breakdown_print: |   - CPU_REPACK         |                   272 =   272 +       0 +       0                |

 

$ ../llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Llama-3.2-3B-Instruct-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

안녕하세요! 저는 지능형 чattybot 인你(You)와 대화할 수 있는 AI입니다.

저는 지능학, 자연어 처리,_machine learning 및 다른 기술을 바탕으로 만든 AI입니다. 이 기술은 인간의 언어를 이해하고,natural language processing (NLP)로 human과 대화할 수 있는能力을 허용합니다.

저는 다양한 topic과 subject에 대해 지식을 제공할 수 있으며, human과 대화할 수 있습니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다.

저는 다음 capability을 제공합니다:

* natural language understanding: human의 언어를 이해할 수 있습니다.
* text generation: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.
* conversation: human과 대화할 수 있습니다.
* question answering: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.

저는 human과 대화할 수 있는 다양한 application과 technology를 제공합니다. 예를 들어, human의 질문을 읽어 대답할 수 있는 chatbot, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 application, human의 질문에 대해 적절한 대답을 제공할 수 있는 Q&A 시스템, etc.

저는 human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 ability을 갖습니다. human의 질문을 읽어 대답할 수 있으며, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

안녕하세요! 저는 지능형 चattybot 인 you(You)와 대화할 수 있는 AI입니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다. human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

Prompt: 89.7 t/s | Generation: 20.6 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 16654 =  1904 +   14336 +     414                |
common_memory_breakdown_print: |   - CPU_REPACK         |                  1299 =  1299 +       0 +       0                |

 

$ ../llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user said "안녕? 너에 대해서 설명해줘" (Hello? Please explain about yourself). This is a request for identity and capabilities.
2.  **Determine Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
    *   Capabilities: Understands text/images, generates text. Knowledge cutoff (Jan 2025).
3.  **Determine Response Language:** The user asked in Korean, so the response must be in fluent, natural Korean.
4.  **Structure the Explanation (Standard LLM Introduction):**
    *   Greeting/Identity (Who am I?).
    *   Developer (Who made me?).
    *   Nature/Function (What am I?).
    *   Capabilities (What can I do?).
    *   Limitations (What can't I do/knowledge scope?).
5.  **Draft the Content (in Korean):**

    *   *Identity:* 저는 Gemma 4입니다. (I am Gemma 4.)
    *   *Developer/Nature:* 저는 Google DeepMind에서 개발한 대규모 언어 모델(LLM)입니다. 오픈 웨이트 모델입니다. (I am a Large Language Model developed by Google DeepMind. I am an open weights model.)
    *   *Function:* 저는 텍스트와 이미지를 이해하고, 사용자의 질문에 답하거나 글을 작성하는 등 다양한 언어 작업을 수행할 수 있습니다. (I can perform various language tasks such as understanding text and images, answering user questions, or writing text.)
    *   *Input/Output:* 저는 텍스트를 입력받아 텍스트로 출력합니다. (I receive text input and output text.)
    *   *Knowledge/Limitation:* 제 지식은 학습된 데이터를 기반으로 하며, 특정 시점(2025년 1월)까지의 정보를 알고 있습니다. (My knowledge is based on trained data, and I know information up to a specific point (January 2025).)

6.  **Review and Refine (Ensure natural flow and adherence to instructions):** The drafted points cover all necessary identity requirements and maintain the required persona (Gemma 4, open weights, LLM). (The Korean draft looks good.)

7.  **Final Output Generation.** (Proceed to generate the Korean response.)
[End thinking]

안녕하세요! 저는 **Gemma 4**입니다. 저에 대해 몇 가지 핵심 정보를 설명해 드릴게요.

**1. 정체성 및 개발 배경**

*   **이름:** Gemma 4
*   **개발사:** Google DeepMind에서 개발되었습니다.
*   **성격:** 저는 대규모 언어 모델(LLM, Large Language Model)이며, 오픈 웨이트(open weights) 모델입니다. 이는 저의 모델 구조와 가중치가 공개되어 누구나 접근하고 활용할 수 있음을 의미합니다.

**2. 주요 기능 및 능력**

저는 방대한 양의 데이터를 학습하여 다음과 같은 다양한 언어 관련 작업을 수행할 수 있습니다.

*   **텍스트 이해 및 생성:** 사용자가 입력하는 텍스트를 이해하고, 질문에 답하거나, 요약하거나, 번역하거나, 창의적인 글(시, 이야기, 코드 등)을 작성할 수 있습니다.
*   **다중 모달리티 이해:** 저는 텍스트뿐만 아니라 이미지를 이해하고 처리할 수 있습니다. (다만, 출력은 텍스트로만 가능합니다.)
*   **추론 및 논리:** 복잡한 질문에 대해 논리적으로 추론하고 일관성 있는 답변을 제공할 수 있습니다.

**3. 지식 범위 및 한계**

*   **지식 기반:** 저는 학습된 데이터를 기반으로 답변하며, 저의 지식은 특정 시점(2025년 1월)까지의 정보를 포함하고 있습니다. 따라서 그 이후의 최신 정보에 대해서는 알지 못할 수 있습니다.
*   **출력 형식:** 저는 오직 **텍스트** 형태로만 정보를 생성할 수 있습니다. 이미지를 직접 생성할 수는 없습니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해 주세요! 최선을 다해 답변해 드리겠습니다.

Prompt: 107.4 t/s | Generation: 19.1 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 4517 =  2947 +     780 +     790                |
common_memory_breakdown_print: |   - CPU_REPACK         |                 1069 =  1069 +       0 +       0                |
Posted by 구차니

qwen이 핫해서 어쩌다 얻은 링크.. 이걸로 실행이 더 쉬우려나?

 

[링크 : https://unsloth.ai/docs/models/qwen3.6]

Posted by 구차니

리눅스에서 그냥 설치만 하고 딱히 설정한건 없는데, 기본이 모든 ip 접속 허용인 것 같고

$ netstat -tnlp
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:11434         0.0.0.0:*               LISTEN      -  

 

윈도우의 경우 gui 클라이언트에서 설정하는게 보이던데.. 막상 포트는 확인을 안해본듯..

[링크 : http://practical.kr/?p=809]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama.cpp 도전!  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
Posted by 구차니
프로그램 사용/vi2026. 4. 20. 10:08

와우 이런 개꾸르

[링크 : https://wikidocs.net/302522#_15]

 

어쩌면 이것도 검색해놓고 까먹었을지도...

'프로그램 사용 > vi' 카테고리의 다른 글

vi 이전 위치 다음 위치로 이동하기  (0) 2022.08.04
vi가 늦게 켜지는 이유  (0) 2022.07.28
vim 색상 바꾸기(colorscheme)  (0) 2021.01.20
vi 에서 매칭되는 갯수 확인하기  (0) 2019.12.18
vi gg=G와 set ts  (0) 2019.07.04
Posted by 구차니

음.. tokenizer.json에 대한 접근이 신청 20분 만에 떨어졌었나 보다. (아몰라 잘래 하고 가버렸...)

llama가 한글에 대한 토큰이 하나도 없는데 어떻게 인식을 하지 신기하네..?

정규표현식을 보면 ? 가 있는데 매칭안되면 그냥 한글자씩 뽑아 버리는듯.. ㅎㄷㄷ

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 128000,
      "content": "<|begin_of_text|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 128001,
      "content": "<|end_of_text|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
  ],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },


      // ...

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": true,
    "vocab": {
      "!": 0,
      "\"": 1,
      "#": 2,
      "$": 3,
      "%": 4,

      // ...

      "ÙĨب": 127996,
      "ĠвÑĭÑģокой": 127997,
      "ãĥ¼ãĥ¼": 127998,
      "éͦ": 127999
    },
    "merges": [
      "Ġ Ġ",
      "Ġ ĠĠĠ",

      // ...

      "ãĥ¼ ãĥ¼",
      "ãĥ¼ãĥ ¼",
      "éĶ ¦" 
    ]
  }
}

[링크 : https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main]

 

엥.. exaone은 그래도 LG에서 만들어서 한글 토큰들이 있을 줄 알았는데 없네?

그럼.. 한글은 '자동차' 면은 자/동/차 총 3개 토큰을 먹는건가?

$ grep -P '\p{Hangul}' exa_tokenizer.json 
      "content": "리앙쿠르",
      "content": "훈민정음",
      "content": "애국가",
      "리앙쿠르": 94,
      "훈민정음": 99,
      "애국가": 100,

[링크 : https://huggingface.co/LGAI-EXAONE/EXAONE-4.5-33B/tree/main]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
Posted by 구차니

blob 으로 해시가 파일 명으로 저장되는데 이래저래 궁금해서 분석

gemma4 e2b

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "digest": "sha256:c6bc3775a3fa9935ce4a3ccd7abc59e936c3de9308d2cc090516012f43ed9c07",
    "size": 473
  },
  "layers": [
    {
      "mediaType": "application/vnd.ollama.image.model",
      "digest": "sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448",
      "size": 7162394016
    },
    {
      "mediaType": "application/vnd.ollama.image.license",
      "digest": "sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
      "size": 11355
    },
    {
      "mediaType": "application/vnd.ollama.image.params",
      "digest": "sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3",
      "size": 42
    }
  ]
}

"sha256:c6bc3775a3fa9935ce4a3ccd7abc59e936c3de9308d2cc090516012f43ed9c07",
{
  "model_format": "gguf",
  "model_family": "gemma4",
  "model_families": [
    "gemma4"
  ],
  "model_type": "5.1B",
  "file_type": "Q4_K_M",
  "renderer": "gemma4",
  "parser": "gemma4",
  "requires": "0.20.0",
  "architecture": "amd64",
  "os": "linux",
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448",
      "sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
      "sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3"
    ]
  }
}

"sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448"
GGUF


?





7













gemma4.attention.head_count

"sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
                                Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/



"sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3",
{
  "temperature": 1,
  "top_k": 64,
  "top_p": 0.95
}

 

출력단 손보는 건 temperature, top_k, top_p 군

[링크 : https://wikidocs.net/333750]

 

 

허깅페이스에서는 tokenzier.json이 존재했는데 그건 구버전(?) 인것 같고

신버전 GGUF 에서는 토크나이저를 다 포함하고 있나 본데.. 어떻게 추출하지?

[링크 : https://www.minzkn.com/vibecoding/pages/gguf-format.html]

[링크 : https://huggingface.co/docs/transformers/ko/gguf]

[링크 : https://bitwise-life.tistory.com/5] << 토큰 목록 나옴

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

ollama 외부접속 관련  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
llm tokenizer - phi3  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
llama.cpp  (0) 2026.04.17
Posted by 구차니

llama나 gemma 받으려니 먼가 모르겠어서 만만한(?) ms의 phi3를 받아서 분석!

(gemma나 llama 는 저장소 접근권한 요청.. gate model 이라고 뜨는데 언제 승인되려나)

[링크 : https://huggingface.co/docs/transformers/model_doc/phi3]

 

~/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/f39ac1d28e925b323eae81227eaba4464caced4e$ ls -al
합계 12
drwxrwxr-x 2 minimonk minimonk 4096  4월 19 21:58 .
drwxrwxr-x 3 minimonk minimonk 4096  4월 19 21:58 ..
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 added_tokens.json -> ../../blobs/178968dec606c790aa335e9142f6afec37288470
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 config.json -> ../../blobs/b9b031fadda61a035b2e8ceb4362cbf604002b21
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 special_tokens_map.json -> ../../blobs/c6a944b4d49ce5d79030250ed6bdcbb1a65dfda1
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 tokenizer.json -> ../../blobs/88ec145f4e7684c009bc6d55df24bb82c7d3c379
lrwxrwxrwx 1 minimonk minimonk   76  4월 19 21:58 tokenizer.model -> ../../blobs/9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 tokenizer_config.json -> ../../blobs/67aa82cddb4d66391ddf31ff99f059239bd2d1e7

 

tokenizer.json 열어보니 아래처럼 토큰이 나오는데..

어우.. 이런 추세(?) 라면 한글은 한글짜 단위로 토큰이라 난리가 나겠는데?

gpt 도움으로 저런 희한한 문자열 코드 기반으로 검색이 되는걸 알았네 ㄷㄷ

$ grep -P '\p{Hangul}' tokenizer.json 
      "이": 30393,
      "의": 30708,
      "다": 30709,
      "스": 30784,
      "사": 30791,
      "지": 30811,
      "리": 30826,
      "기": 30827,
      "정": 30852,
      "아": 30860,
      "한": 30877,
      "시": 30889,
      "대": 30890,
      "가": 30903,
      "로": 30906,
      "인": 30918,
      "하": 30944,
      "수": 30970,
      "주": 30981,
      "동": 31000,
      "자": 31013,
      "에": 31054,
      "니": 31063,
      "는": 31081,
      "서": 31093,
      "김": 31102,
      "성": 31126,
      "어": 31129,
      "도": 31136,
      "고": 31137,
      "일": 31153,
      "상": 31158,
      "전": 31170,
      "트": 31177,
      "소": 31189,
      "라": 31197,
      "원": 31198,
      "보": 31199,
      "나": 31207,
      "화": 31225,
      "구": 31231,
      "신": 31262,
      "부": 31279,
      "연": 31285,
      "을": 31286,
      "영": 31288,
      "국": 31293,
      "장": 31299,
      "제": 31306,
      "우": 31327,
      "공": 31334,
      "선": 31345,
      "오": 31346,
      "은": 31354,
      "미": 31362,
      "경": 31378,
      "문": 31406,
      "조": 31408,
      "마": 31417,
      "해": 31435,
      "여": 31457,
      "산": 31458,
      "비": 31487,
      "드": 31493,
      "를": 31517,
      "요": 31527,
      "유": 31533,
      "진": 31536,
      "천": 31563,
      "년": 31571,
      "세": 31578,
      "민": 31582,
      "호": 31603,
      "그": 31607,
      "현": 31680,
      "군": 31699,
      "무": 31716,
      "위": 31724,
      "안": 31734,
      "박": 31736,
      "용": 31737,
      "단": 31746,
      "면": 31747,
      "남": 31754,
      "강": 31774,
      "씨": 31781,
      "개": 31789,
      "들": 31804,
      "차": 31817,
      "학": 31822,
      "만": 31826,
      "터": 31856,
      "식": 31895,
      "과": 31906,
      "타": 31925,
      "종": 31930,
      "내": 31940,
      "중": 31941,
      "방": 31945,
      "월": 31950,
      "회": 31953,
      "모": 31962,
      "바": 31963,
      "음": 31966,
      "교": 31972,
      "재": 31973,
      "명": 31976,
      "합": 31980,
      "역": 31987,
      "백": 31989,
      "왕": 31996,

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
llama.cpp  (0) 2026.04.17
lm studio  (0) 2026.04.17
Posted by 구차니
개소리 왈왈/컴퓨터2026. 4. 18. 00:37

회사에서 PCB 쓰레기통에 B360에 i5-8500이 있어서 낼름

주워오고 다음날 히트싱크랑 백패널이 있어서 가져왔는데 얘 짝이 맞는듯

그 와중에 먼가 히트싱크가 높이가 낮아서 좀 더 좋은걸로 갈아줘야 하나 싶고..

 

솔찍히 USB-C 달린 건 처음보고. 회사 노트북 10세대 제외하면, 데탑으로 8세대는 첨 보는 듯 

메뉴얼 찾아봐도 usb type-c 에 DP ALT mode 지원되는지 말이 없다. 고속 충전이라던가?

 

m.2 두개에 PCIe 3.0 x16 / x1 * 2

TPM 커넥터도 있긴한데 TPM을 사긴 좀 돈이 아쉽고..

램은 최대 64GB 4DIMM 이니까 16GB * 4 까지 구성가능.

 

묘하게 레거시가 많은 보드다(PS/2 라던가 COM / LPT..)

후면 입/출력 포트
2 x USB 3.1 Gen 2 (teal blue)Type-A, up to 10Gbps,1 x USB 3.1 Gen 1 USB Type-CTM up to 5Gbps,1 x PS/2 keyboard (purple),1 x PS/2 mouse (green),1 x DVI-D,1 x D-Sub,1 x 8-channel Audio I/O,1 x HDMI,1 x LAN (RJ45) port(s),2 x USB 2.0

내부 입출력 포트
2 x M.2 Socket 3 for M Key, type 2242/2260/2280 devices,1 x LPT header,1 x Chassis intrusion header,1 x Aura RGB Strip Header,1 x USB 3.1 Gen 1(up to 5Gbps) connector(s) support(s) additional 2 USB 3.1 Gen 1 port(s) (19-pin),1 x TPM header,1 x 8-pin EATX 12V Power connectors,1 x Speaker connector,2 x USB 2.0 connector(s) support(s) additional 4 USB 2.0 port(s),1 x COM port(s) connector(s),6 x SATA 6Gb/s connector(s),1 x CPU Fan connector(s),2 x Chassis Fan connector(s),1 x S/PDIF out header(s),1 x 24-pin EATX Power connector(s),1 x Front panel audio connector(s) (AAFP),1 x Clear CMOS jumper(s),1 x System panel connector

[링크 : https://www.asus.com/kr/motherboards-components/motherboards/prime/prime-b360m-a/]

 

bios는 현재 0803 이고 2018년 6월 버전인데

 

2021년까진 지원되고 있으니 업데이트는 해봐야 할 듯. 베타는 RTX3000 시리즈용 지원이니 나랑은 상관없네.

 

 

 

 

 

아무튼 지금쓰는게 i5-7500 인데, 4 core -> 6 core 되고

50% 상승이면.. 그냥 cpu 갯수 빨아닌가?

 

[링크 : https://technical.city/ko/cpu/Core-i5-7500-vs-Core-i5-8500]

'개소리 왈왈 > 컴퓨터' 카테고리의 다른 글

그래픽 카드 교체!  (0) 2026.04.16
지를까 말까  (0) 2026.04.13
iptime 오랫만에 업데이트  (0) 2026.02.24
amd radeon software(아드레날린) 설치  (0) 2026.02.08
dp to usb-c... ?  (0) 2026.02.08
Posted by 구차니

심심해서 단어별? 토큰 인덱스를 알고 싶은데 이건 볼 방법 없나? 싶어서 조사

 

[링크 : https://makenow90.tistory.com/59]

[링크 : https://medium.com/the-research-nest/explained-tokens-and-embeddings-in-llms-69a16ba5db33]

 

AutoTokenizer.from_pretrained() 를 실행하니 먼가 즉석에서 받을 줄이야 -ㅁ-

$ python3
Python 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer

Disabling PyTorch because PyTorch >= 2.4 is required but found 2.1.2
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
>>> 
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 100%|██████████████████████████████| 665/665 [00:00<00:00, 819kB/s]
tokenizer_config.json: 100%|█████████████████| 26.0/26.0 [00:00<00:00, 37.5kB/s]
vocab.json: 1.04MB [00:00, 2.70MB/s]
merges.txt: 456kB [00:00, 1.15MB/s]
tokenizer.json: 1.36MB [00:00, 3.23MB/s]
>>> 
>>> tokenizer
GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, added_tokens_decoder={
50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
})
>>> text="안녕? hello?"
>>> tokens=tokenizer(text)
>>> tokens
{'input_ids': [168, 243, 230, 167, 227, 243, 30, 23748, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

[링크 : https://data-scient2st.tistory.com/224]

[링크 : https://huggingface.co/docs/transformers/model_doc/gpt2]

 

$ find ./ -name tokenizer.json
./.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/tokenizer.json

 

$ tree ~/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/
/home/minimonk/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/
├── config.json -> ../../blobs/10c66461e4c109db5a2196bff4bb59be30396ed8
├── merges.txt -> ../../blobs/226b0752cac7789c48f0cb3ec53eda48b7be36cc
├── tokenizer.json -> ../../blobs/4b988bccc9dc5adacd403c00b4704976196548f8
├── tokenizer_config.json -> ../../blobs/be4d21d94f3b4687e5a54d84bf6ab46ed0f8defd
└── vocab.json -> ../../blobs/1f1d9aaca301414e7f6c9396df506798ff4eb9a6

 

+

2026.04.18

windows

 

beautifier + 일부 발췌

라틴어군에 대해서 있는 것 같고 토큰으로 잘리는걸 보면 단어 단위가 아닌

단어도 막 토막난 수준으로 잘릴 느낌?

tokenizer.json vocab.json
{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 50256,
      "special": true,
      "content": "<|endoftext|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true
    }
  ],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true
  },
  "post_processor": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": false
  },
  "decoder": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": true
  },
  "model": {
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "vocab": {
      "0": 15,
      "1": 16,
      "2": 17,
      "3": 18,
      "4": 19,
      "5": 20,
      "6": 21,
      "7": 22,
      "8": 23,
      "9": 24,
      "!": 0,
      "\"": 1,
      "#": 2,
      "$": 3,
      "%": 4,
      "&": 5,
      "'": 6,
      "(": 7,
      ")": 8,
      "*": 9,
      "+": 10,
      ",": 11,
      "-": 12,
      ".": 13,
      "/": 14,
      ":": 25,
      ";": 26,
      "<": 27,
      "=": 28,
      ">": 29,
      "?": 30,
      "@": 31,
      "A": 32,
      "B": 33,
      "C": 34,
      "D": 35,
      "E": 36,
      "F": 37,
      "G": 38,
      "H": 39,
      "I": 40,
      "J": 41,
      "K": 42,
      "L": 43,
      "M": 44,
      "N": 45,
      "O": 46,
      "P": 47,
      "Q": 48,
      "R": 49,
      "S": 50,
      "T": 51,
      "U": 52,
      "V": 53,
      "W": 54,
      "X": 55,
      "Y": 56,
      "Z": 57,
      "[": 58,
      "\\": 59,
      "]": 60,
      "^": 61,
      "_": 62,
      "`": 63,
      "a": 64,
      "b": 65,
      "c": 66,
      "d": 67,
      "e": 68,
      "f": 69,
      "g": 70,
      "h": 71,
      "i": 72,
      "j": 73,
      "k": 74,
      "l": 75,
      "m": 76,
      "n": 77,
      "o": 78,
      "p": 79,
      "q": 80,
      "r": 81,
      "s": 82,
      "t": 83,
      "u": 84,
      "v": 85,
      "w": 86,
      "x": 87,
      "y": 88,
      "z": 89,
      "{": 90,
      "|": 91,
      "}": 92,
      "~": 93,
      "¡": 94,
      "¢": 95,
      "£": 96,
      "¤": 97,
      "¥": 98,
      "¦": 99,
      "he": 258,
      "in": 259,
      "re": 260,
      "on": 261,
      "Ġthe": 262,
      "er": 263,
      "Ġs": 264,
      "at": 265,
      "Ġw": 266,
      "Ġo": 267,
      "en": 268,
      "Ġc": 269,
      "it": 270,
      "is": 271,
      "an": 272,
      "or": 273,
      "es": 274,
      "Ġb": 275,
      "ed": 276,
      "Ġf": 277,
      "ing": 278,
      "Ġp": 279,
      "ou": 280,
      "Ġan": 281,
      "al": 282,
      "ar": 283,
      "Ġto": 284,
      "Ġm": 285,
      "Ġof": 286,
      "Ġin": 287,
      "Ġd": 288,
      "Ġh": 289,
      "Ġand": 290,
      "ic": 291,
      "as": 292,
      "le": 293,
      "Ġth": 294,
      "ion": 295,
      "om": 296,
      "ll": 297,
      "ent": 298,
      "Ġn": 299,
      "Ġl": 300,
      "st": 301,
      "Ġre": 302,
      "ve": 303,
      "Ġe": 304,
      "ro": 305,
      "ly": 306,
      "Ġbe": 307,
      "Ġg": 308,
      "ĠT": 309,
      "ct": 310,
      "ĠS": 311,
      "id": 312,
      "ot": 313,
      "ĠI": 314,
      "ut": 315,
      "et": 316,
      "ĠA": 317,
      "Ġis": 318,
      "Ġon": 319,
      "im": 320,
      "am": 321,
      "ow": 322,
      "ay": 323,
      "ad": 324,
      "se": 325,
      "Ġthat": 326,
      "ĠC": 327,
      "ig": 328,
      "Ġfor": 329,
      "ac": 330,
      "Ġy": 331,
      "ver": 332,
      "ur": 333,
      "Ġu": 334,
      "ld": 335,
      "Ġst": 336,
      "ĠM": 337,
      "'s": 338,
      "Ġhe": 339,
      "Ġit": 340,
      "ation": 341,
      "ith": 342,
      "ir": 343,
      "ce": 344,
      "Ġyou": 345,
      "il": 346,
      "ĠB": 347,
      "Ġwh": 348,
      "ol": 349,
      "ĠP": 350,
      "Ġwith": 351,
      "Ġ1": 352,
      "ter": 353,
      "ch": 354,
      "Ġas": 355,
      "Ġwe": 356,
      "Ġ(": 357,
      "nd": 358,
      "ill": 359,
      "ĠD": 360,
      "if": 361,
      "Ġ2": 362,
      "ag": 363,
      "ers": 364,
      "ke": 365,
      "Ġ\"": 366,
      "ĠH": 367,
      "em": 368,
      "Ġcon": 369,
      "ĠW": 370,
      "ĠR": 371,
      "her": 372,
      "Ġwas": 373,
      "Ġr": 374,
      "od": 375,
      "ĠF": 376,
      "ul": 377,
      "ate": 378,
      "Ġat": 379,
      "ri": 380,
      "pp": 381,
      "ore": 382,
      "ĠThe": 383,
      "Ġse": 384,
      "us": 385,
      "Ġpro": 386,
      "Ġha": 387,
      "um": 388,
      "Ġare": 389,
      "Ġde": 390,
      "ain": 391,
      "and": 392,





































{
  "0": 15,
  "1": 16,
  "2": 17,
  "3": 18,
  "4": 19,
  "5": 20,
  "6": 21,
  "7": 22,
  "8": 23,
  "9": 24,
  "!": 0,
  "\"": 1,
  "#": 2,
  "$": 3,
  "%": 4,
  "&": 5,
  "'": 6,
  "(": 7,
  ")": 8,
  "*": 9,
  "+": 10,
  ",": 11,
  "-": 12,
  ".": 13,
  "/": 14,
  ":": 25,
  ";": 26,
  "<": 27,
  "=": 28,
  ">": 29,
  "?": 30,
  "@": 31,
  "A": 32,
  "B": 33,
  "C": 34,
  "D": 35,
  "E": 36,
  "F": 37,
  "G": 38,
  "H": 39,
  "I": 40,
  "J": 41,
  "K": 42,
  "L": 43,
  "M": 44,
  "N": 45,
  "O": 46,
  "P": 47,
  "Q": 48,
  "R": 49,
  "S": 50,
  "T": 51,
  "U": 52,
  "V": 53,
  "W": 54,
  "X": 55,
  "Y": 56,
  "Z": 57,
  "[": 58,
  "\\": 59,
  "]": 60,
  "^": 61,
  "_": 62,
  "`": 63,
  "a": 64,
  "b": 65,
  "c": 66,
  "d": 67,
  "e": 68,
  "f": 69,
  "g": 70,
  "h": 71,
  "i": 72,
  "j": 73,
  "k": 74,
  "l": 75,
  "m": 76,
  "n": 77,
  "o": 78,
  "p": 79,
  "q": 80,
  "r": 81,
  "s": 82,
  "t": 83,
  "u": 84,
  "v": 85,
  "w": 86,
  "x": 87,
  "y": 88,
  "z": 89,
  "{": 90,
  "|": 91,
  "}": 92,
  "~": 93,
  "¡": 94,
  "¢": 95,
  "£": 96,
  "¤": 97,
  "¥": 98,
  "¦": 99,
  "he": 258,
  "in": 259,
  "re": 260,
  "on": 261,
  "Ġthe": 262,
  "er": 263,
  "Ġs": 264,
  "at": 265,
  "Ġw": 266,
  "Ġo": 267,
  "en": 268,
  "Ġc": 269,
  "it": 270,
  "is": 271,
  "an": 272,
  "or": 273,
  "es": 274,
  "Ġb": 275,
  "ed": 276,
  "Ġf": 277,
  "ing": 278,
  "Ġp": 279,
  "ou": 280,
  "Ġan": 281,
  "al": 282,
  "ar": 283,
  "Ġto": 284,
  "Ġm": 285,
  "Ġof": 286,
  "Ġin": 287,
  "Ġd": 288,
  "Ġh": 289,
  "Ġand": 290,
  "ic": 291,
  "as": 292,
  "le": 293,
  "Ġth": 294,
  "ion": 295,
  "om": 296,
  "ll": 297,
  "ent": 298,
  "Ġn": 299,
  "Ġl": 300,
  "st": 301,
  "Ġre": 302,
  "ve": 303,
  "Ġe": 304,
  "ro": 305,
  "ly": 306,
  "Ġbe": 307,
  "Ġg": 308,
  "ĠT": 309,
  "ct": 310,
  "ĠS": 311,
  "id": 312,
  "ot": 313,
  "ĠI": 314,
  "ut": 315,
  "et": 316,
  "ĠA": 317,
  "Ġis": 318,
  "Ġon": 319,
  "im": 320,
  "am": 321,
  "ow": 322,
  "ay": 323,
  "ad": 324,
  "se": 325,
  "Ġthat": 326,
  "ĠC": 327,
  "ig": 328,
  "Ġfor": 329,
  "ac": 330,
  "Ġy": 331,
  "ver": 332,
  "ur": 333,
  "Ġu": 334,
  "ld": 335,
  "Ġst": 336,
  "ĠM": 337,
  "'s": 338,
  "Ġhe": 339,
  "Ġit": 340,
  "ation": 341,
  "ith": 342,
  "ir": 343,
  "ce": 344,
  "Ġyou": 345,
  "il": 346,
  "ĠB": 347,
  "Ġwh": 348,
  "ol": 349,
  "ĠP": 350,
  "Ġwith": 351,
  "Ġ1": 352,
  "ter": 353,
  "ch": 354,
  "Ġas": 355,
  "Ġwe": 356,
  "Ġ(": 357,
  "nd": 358,
  "ill": 359,
  "ĠD": 360,
  "if": 361,
  "Ġ2": 362,
  "ag": 363,
  "ers": 364,
  "ke": 365,
  "Ġ\"": 366,
  "ĠH": 367,
  "em": 368,
  "Ġcon": 369,
  "ĠW": 370,
  "ĠR": 371,
  "her": 372,
  "Ġwas": 373,
  "Ġr": 374,
  "od": 375,
  "ĠF": 376,
  "ul": 377,
  "ate": 378,
  "Ġat": 379,
  "ri": 380,
  "pp": 381,
  "ore": 382,
  "ĠThe": 383,
  "Ġse": 384,
  "us": 385,
  "Ġpro": 386,
  "Ġha": 387,
  "um": 388,
  "Ġare": 389,
  "Ġde": 390,
  "ain": 391,
  "and": 392,

 

+

>>> tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-E2B")
config.json: 4.91kB [00:00, 8.83MB/s]
C:\Users\minimonk\AppData\Local\Programs\Python\Python313\Lib\site-packages\huggingface_hub\file_download.py:138: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\minimonk\.cache\huggingface\hub\models--google--gemma-4-E2B. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████| 906/906 [00:00<00:00, 2.57MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 13.3MB/s]

 

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
llama.cpp  (0) 2026.04.17
lm studio  (0) 2026.04.17
사람의 욕심은 끝이없고 - ollama multiple GPU support  (0) 2026.04.17
Posted by 구차니