'프로그램 사용'에 해당되는 글 2577건

  1. 2026.04.22 llama.cpp 도전!
  2. 2026.04.21 unsloth ai
  3. 2026.04.21 ollama 외부접속 관련
  4. 2026.04.20 vi 괄호끼리 이동 %
  5. 2026.04.20 llm tokenizer - llama 3.2, exaone
  6. 2026.04.19 ollama 모델 저장소 뜯어보기
  7. 2026.04.19 llm tokenizer - phi3
  8. 2026.04.17 llm tokenizer
  9. 2026.04.17 llama.cpp
  10. 2026.04.17 lm studio

귀찮으니까 그냥 pre-built binary로 시도 ㅋㅋ

 

Linux:

[링크 : https://github.com/ggml-org/llama.cpp/releases]

 

gguf 포맷의 모델이 필요하다고 해서, qwen3.6은 포기

[링크 : https://huggingface.co/Qwen/Qwen3.6-35B-A3B/tree/main]

 

우연히 알게된 unsloth의 양자화 모델 발견!

그나저나 gguf 대신 ggul 포맷이면 더 꿀맛이었을텐데.. 쩝

 

copy download link 누르고 wget으로 받으면 된다. 일단 cpu only로 돌릴거라 q2 모델로 시도를..

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf]

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main]

 

일단 llama-cli 로 해보면 될 듯.

llama-cli
llama-cli is the command-line executor:

$ llama-cli -m model.gguf

 

llama-server
llama-server launches an API server with a built-in WebUI:

$ llama-server --host address --port port -m model.gguf

[링크 : https://wiki.archlinux.org/title/Llama.cpp]

[링크 : https://www.lainyzine.com/ko/article/using-llama-cpp-for-local-llm-execution/]

 

---------------------

도움말

$ ./llama-cli --help
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so
----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--license                               show source code license and dependencies
-cl,   --cache-list                     show list of models in cache
--completion-bash                       print source-able bash completion script for llama.cpp
-t,    --threads N                      number of CPU threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : low(-1), normal(0), medium(1), high(2),
                                        realtime(3) (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--swa-full                              use full-size SWA cache (default: false)
                                        [(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
                                        (env: LLAMA_ARG_SWA_FULL)
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--perf, --no-perf                       whether to enable internal libllama performance timings (default:
                                        false)
                                        (env: LLAMA_ARG_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape, --no-escape            whether to process escapes sequences (\n, \r, \t, \', \", \\)
                                        (default: true)
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.00, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: -1.00)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-kvo,  --kv-offload, -nkvo, --no-kv-offload
                                        whether to enable KV cache offloading (default: enabled)
                                        (env: LLAMA_ARG_KV_OFFLOAD)
--repack, -nr, --no-repack              whether to enable weight repacking (default: enabled)
                                        (env: LLAMA_ARG_REPACK)
--no-host                               bypass host buffer allowing extra buffers to be used
                                        (env: LLAMA_ARG_NO_HOST)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (DEPRECATED)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--rpc SERVERS                           comma separated list of RPC servers (host:port)
                                        (env: LLAMA_ARG_RPC)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--mmap, --no-mmap                       whether to memory-map model. (if mmap disabled, slower load but may
                                        reduce pageouts if not using mlock) (default: enabled)
                                        (env: LLAMA_ARG_MMAP)
-dio,  --direct-io, -ndio, --no-direct-io
                                        use DirectIO if available. (default: disabled)
                                        (env: LLAMA_ARG_DIO)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
-ot,   --override-tensor <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
                                        (env: LLAMA_ARG_OVERRIDE_TENSOR)
-cmoe, --cpu-moe                        keep all Mixture of Experts (MoE) weights in the CPU
                                        (env: LLAMA_ARG_CPU_MOE)
-ncmoe, --n-cpu-moe N                   keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU
                                        (env: LLAMA_ARG_N_CPU_MOE)
-ngl,  --gpu-layers, --n-gpu-layers N   max. number of layers to store in VRAM, either an exact number,
                                        'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row,tensor}
                                        how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs (pipelined)
                                        - row: split weight across GPUs by rows (parallelized)
                                        - tensor: split weights and KV across GPUs (parallelized,
                                        EXPERIMENTAL)
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
-fit,  --fit [on|off]                   whether to adjust unset arguments to fit in device memory ('on' or
                                        'off', default: 'on')
                                        (env: LLAMA_ARG_FIT)
-fitt, --fit-target MiB0,MiB1,MiB2,...
                                        target margin per device for --fit, comma-separated list of values,
                                        single value is broadcast across all devices, default: 1024
                                        (env: LLAMA_ARG_FIT_TARGET)
-fitc, --fit-ctx N                      minimum ctx size that can be set by --fit option, default: 4096
                                        (env: LLAMA_ARG_FIT_CTX)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE,...        advanced option to override model metadata by key. to specify multiple
                                        overrides, either use comma-separated values.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false
--op-offload, --no-op-offload           whether to offload host tensor operations to device (default: true)
--lora FNAME                            path to LoRA adapter (use comma-separated values to load multiple
                                        adapters)
--lora-scaled FNAME:SCALE,...           path to LoRA adapter with user defined scaling (format:
                                        FNAME:SCALE,...)
                                        note: use comma-separated values
--control-vector FNAME                  add a control vector
                                        note: use comma-separated values to add multiple control vectors
--control-vector-scaled FNAME:SCALE,...
                                        add a control vector with user defined scaling SCALE
                                        note: use comma-separated values (format: FNAME:SCALE,...)
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path to load
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-dr,   --docker-repo [<repo>/]<model>[:quant]
                                        Docker Hub model repository. repo is optional, default to ai/. quant
                                        is optional, default to :latest.
                                        example: gemma3
                                        (default: unused)
                                        (env: LLAMA_ARG_DOCKER_REPO)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hfd,  -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_HFD_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
                                        (env: LLAMA_LOG_FILE)
--log-colors [on|off|auto]              Set colored logging ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
--offline                               Offline mode: forces use of cache, prevents network access
                                        (env: LLAMA_OFFLINE)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored. Values:
                                         - 0: generic output
                                         - 1: error
                                         - 2: warning
                                         - 3: info
                                         - 4: debug
                                        (default: 1)
                                        
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefix in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE        KV cache data type for K for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE        KV cache data type for V for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default:
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampler-seq, --sampling-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default:
                                        edskypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp, --temperature N                 temperature (default: 0.80)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
                                        (env: LLAMA_ARG_TOP_K)
--top-p N                               top-p sampling (default: 0.95, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.05, 0.0 = disabled)
--top-nsigma, --top-n-sigma N           top-n-sigma sampling (default: -1.00, -1.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.00, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.10, 1.0 = disabled)
--typical, --typical-p N                locally typical sampling, parameter p (default: 1.00, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.00, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.00, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--adaptive-target N                     adaptive-p: select tokens near this probability (valid range 0.0 to
                                        1.0; negative = disabled) (default: -1.00)
                                        [(more info)](https://github.com/ggml-org/llama.cpp/pull/17927)
--adaptive-decay N                      adaptive-p: decay rate for target adaptation over time. lower values
                                        are more reactive, higher values are more stable.
                                        (valid range 0.0 to 0.99) (default: 0.90)
--dynatemp-range N                      dynamic temperature range (default: 0.00, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.00)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.10)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.00)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir)
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-bs,   --backend-sampling               enable backend sampling (experimental) (default: disabled)
                                        (env: LLAMA_ARG_BACKEND_SAMPLING)


----- example-specific params -----

--verbose-prompt                        print a verbose prompt before generation (default: false)
--display-prompt, --no-display-prompt   whether to print prompt at generation (default: true)
-co,   --color [on|off|auto]            Colorize output to distinguish prompt and user input from generations
                                        ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
-ctxcp, --ctx-checkpoints, --swa-checkpoints N
                                        max number of context checkpoints to create per slot (default:
                                        32)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)
                                        (env: LLAMA_ARG_CTX_CHECKPOINTS)
-cpent, --checkpoint-every-n-tokens N   create a checkpoint every n tokens during prefill (processing), -1 to
                                        disable (default: 8192)
                                        (env: LLAMA_ARG_CHECKPOINT_EVERY_NT)
-cram, --cache-ram N                    set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
                                        disable)[(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/16391)
                                        (env: LLAMA_ARG_CACHE_RAM)
--context-shift, --no-context-shift     whether to use context shift on infinite text generation (default:
                                        disabled)
                                        (env: LLAMA_ARG_CONTEXT_SHIFT)
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
                                        template)
--show-timings, --no-show-timings       whether to show timing information after each response (default: true)
                                        (env: LLAMA_ARG_SHOW_TIMINGS)
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation, -no-cnv, --no-conversation
                                        whether to run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: auto enabled if chat template is available)
-st,   --single-turn                    run conversation for a single turn only, then exit when done
                                        will not be interactive if first turn is predefined with --prompt
                                        (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--warmup, --no-warmup                   whether to perform warmup with an empty run (default: enabled)
-mm,   --mmproj FILE                    path to a multimodal projector file. see tools/mtmd/README.md
                                        note: if -hf is used, this argument can be omitted
                                        (env: LLAMA_ARG_MMPROJ)
-mmu,  --mmproj-url URL                 URL to a multimodal projector file. see tools/mtmd/README.md
                                        (env: LLAMA_ARG_MMPROJ_URL)
--mmproj-auto, --no-mmproj, --no-mmproj-auto
                                        whether to use multimodal projector file (if available), useful when
                                        using -hf (default: enabled)
                                        (env: LLAMA_ARG_MMPROJ_AUTO)
--mmproj-offload, --no-mmproj-offload   whether to enable GPU offloading for multimodal projector (default:
                                        enabled)
                                        (env: LLAMA_ARG_MMPROJ_OFFLOAD)
--image, --audio FILE                   path to an image or audio file. use with multimodal models, use
                                        comma-separated values for multiple files
--image-min-tokens N                    minimum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N                    maximum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MAX_TOKENS)
-otd,  --override-tensor-draft <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type for draft model
-cmoed, --cpu-moe-draft                 keep all Mixture of Experts (MoE) weights in the CPU for the draft
                                        model
                                        (env: LLAMA_ARG_CPU_MOE_DRAFT)
-ncmoed, --n-cpu-moe-draft N            keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU for the draft model
                                        (env: LLAMA_ARG_N_CPU_MOE_DRAFT)
--chat-template-kwargs STRING           sets additional params for the json template parser, must be a valid
                                        json object string, e.g. '{"key1":"value1","key2":"value2"}'
                                        (env: LLAMA_CHAT_TEMPLATE_KWARGS)
--jinja, --no-jinja                     whether to use jinja template engine for chat (default: enabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               controls whether thought tags are allowed and/or extracted from the
                                        response, and in which format they're returned; one of:
                                        - none: leaves thoughts unparsed in `message.content`
                                        - deepseek: puts thoughts in `message.reasoning_content`
                                        - deepseek-legacy: keeps `<think>` tags in `message.content` while
                                        also populating `message.reasoning_content`
                                        (default: auto)
                                        (env: LLAMA_ARG_THINK)
-rea,  --reasoning [on|off|auto]        Use reasoning/thinking in the chat ('on', 'off', or 'auto', default:
                                        'auto' (detect from template))
                                        (env: LLAMA_ARG_REASONING)
--reasoning-budget N                    token budget for thinking: -1 for unrestricted, 0 for immediate end,
                                        N>0 for token budget (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)
--reasoning-budget-message MESSAGE      message injected before the end-of-thinking tag when reasoning budget
                                        is exhausted (default: none)
                                        (env: LLAMA_ARG_THINK_BUDGET_MESSAGE)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--skip-chat-parsing, --no-skip-chat-parsing
                                        force a pure content parser, even if a Jinja template is specified;
                                        model will output everything in the content section, including any
                                        reasoning and/or tool calls (default: disabled)
                                        (env: LLAMA_ARG_SKIP_CHAT_PARSING)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles
--draft, --draft-n, --draft-max N       number of tokens to draft for speculative decoding (default: 16)
                                        (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
                                        (default: 0)
                                        (env: LLAMA_ARG_DRAFT_MIN)
--draft-p-min P                         minimum speculative decoding probability (greedy) (default: 0.75)
                                        (env: LLAMA_ARG_DRAFT_P_MIN)
-cd,   --ctx-size-draft N               size of the prompt context for the draft model (default: 0, 0 = loaded
                                        from model)
                                        (env: LLAMA_ARG_CTX_SIZE_DRAFT)
-devd, --device-draft <dev1,dev2,..>    comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        max. number of draft model layers to store in VRAM, either an exact
                                        number, 'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
-md,   --model-draft FNAME              draft model for speculative decoding (default: unused)
                                        (env: LLAMA_ARG_MODEL_DRAFT)
--spec-replace TARGET DRAFT             translate the string in TARGET into DRAFT if the draft model and main
                                        model are not compatible
--gpt-oss-20b-default                   use gpt-oss-20b (note: can download weights from the internet)
--gpt-oss-120b-default                  use gpt-oss-120b (note: can download weights from the internet)
--vision-gemma-4b-default               use Gemma 3 4B QAT (note: can download weights from the internet)
--vision-gemma-12b-default              use Gemma 3 12B QAT (note: can download weights from the internet)
--spec-default                          enable default speculative decoding config

 

 

--------------------- 벤치마크?

Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz / 256GB DDR4

$ ../llama-cli -m Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf 
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello~ introduce about you         

[Start thinking]
Here's a thinking process:

1.  **Analyze User Input:**
   - User says: "hello~ introduce about you"
   - This is a casual greeting followed by a request for a self-introduction.
   - The tone is friendly and informal.

2.  **Identify Key Requirements:**
   - Acknowledge the greeting
   - Provide a clear, concise self-introduction
   - State my identity (Qwen / Tongyi Qianwen)
   - Mention my developer (Alibaba Group's Tongyi Lab)
   - Highlight my core capabilities (language understanding, reasoning, coding, multimodal processing, etc.)
   - Keep it friendly, accurate, and aligned with the provided guidelines
   - Avoid mentioning version numbers or specific details not in the reference

3.  **Draft Construction (Mental Refinement):**
   Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—from answering questions and writing content to solving complex reasoning problems, coding, and working with images and documents. I support fluent communication in over 100 languages and can adapt to different contexts while maintaining accuracy and safety. I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything! How can I assist you today?

4.  **Check Against Guidelines:**
   - Identity: Qwen / Tongyi Qianwen ✓
   - Developer: Alibaba Group's Tongyi Lab ✓
   - Capabilities: Language, reasoning, coding, multimodal, 100+ languages, helpful/safe ✓
   - Tone: Friendly, professional, matches user's casual tone ✓
   - No version numbers, no fabricated stats, no overclaiming ✓

5.  **Final Polish:**
   The draft looks solid. It's concise, accurate, and aligned with all constraints. I'll output it.✅

[End thinking]

Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—whether it's answering questions, writing or editing content, solving complex reasoning problems, coding, analyzing documents, or working with images. I support fluent communication in over 100 languages and can adapt to different tones, formats, and contexts while prioritizing accuracy and safety.

I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything. How can I assist you today? 😊

Prompt: 29.6 t/s | Generation: 11.0 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 17428 = 11437 +    5182 +     808                |
common_memory_breakdown_print: |   - CPU_REPACK         |                   272 =   272 +       0 +       0                |

 

$ ../llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Llama-3.2-3B-Instruct-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

안녕하세요! 저는 지능형 чattybot 인你(You)와 대화할 수 있는 AI입니다.

저는 지능학, 자연어 처리,_machine learning 및 다른 기술을 바탕으로 만든 AI입니다. 이 기술은 인간의 언어를 이해하고,natural language processing (NLP)로 human과 대화할 수 있는能力을 허용합니다.

저는 다양한 topic과 subject에 대해 지식을 제공할 수 있으며, human과 대화할 수 있습니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다.

저는 다음 capability을 제공합니다:

* natural language understanding: human의 언어를 이해할 수 있습니다.
* text generation: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.
* conversation: human과 대화할 수 있습니다.
* question answering: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.

저는 human과 대화할 수 있는 다양한 application과 technology를 제공합니다. 예를 들어, human의 질문을 읽어 대답할 수 있는 chatbot, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 application, human의 질문에 대해 적절한 대답을 제공할 수 있는 Q&A 시스템, etc.

저는 human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 ability을 갖습니다. human의 질문을 읽어 대답할 수 있으며, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

안녕하세요! 저는 지능형 चattybot 인 you(You)와 대화할 수 있는 AI입니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다. human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

Prompt: 89.7 t/s | Generation: 20.6 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 16654 =  1904 +   14336 +     414                |
common_memory_breakdown_print: |   - CPU_REPACK         |                  1299 =  1299 +       0 +       0                |

 

$ ../llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user said "안녕? 너에 대해서 설명해줘" (Hello? Please explain about yourself). This is a request for identity and capabilities.
2.  **Determine Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
    *   Capabilities: Understands text/images, generates text. Knowledge cutoff (Jan 2025).
3.  **Determine Response Language:** The user asked in Korean, so the response must be in fluent, natural Korean.
4.  **Structure the Explanation (Standard LLM Introduction):**
    *   Greeting/Identity (Who am I?).
    *   Developer (Who made me?).
    *   Nature/Function (What am I?).
    *   Capabilities (What can I do?).
    *   Limitations (What can't I do/knowledge scope?).
5.  **Draft the Content (in Korean):**

    *   *Identity:* 저는 Gemma 4입니다. (I am Gemma 4.)
    *   *Developer/Nature:* 저는 Google DeepMind에서 개발한 대규모 언어 모델(LLM)입니다. 오픈 웨이트 모델입니다. (I am a Large Language Model developed by Google DeepMind. I am an open weights model.)
    *   *Function:* 저는 텍스트와 이미지를 이해하고, 사용자의 질문에 답하거나 글을 작성하는 등 다양한 언어 작업을 수행할 수 있습니다. (I can perform various language tasks such as understanding text and images, answering user questions, or writing text.)
    *   *Input/Output:* 저는 텍스트를 입력받아 텍스트로 출력합니다. (I receive text input and output text.)
    *   *Knowledge/Limitation:* 제 지식은 학습된 데이터를 기반으로 하며, 특정 시점(2025년 1월)까지의 정보를 알고 있습니다. (My knowledge is based on trained data, and I know information up to a specific point (January 2025).)

6.  **Review and Refine (Ensure natural flow and adherence to instructions):** The drafted points cover all necessary identity requirements and maintain the required persona (Gemma 4, open weights, LLM). (The Korean draft looks good.)

7.  **Final Output Generation.** (Proceed to generate the Korean response.)
[End thinking]

안녕하세요! 저는 **Gemma 4**입니다. 저에 대해 몇 가지 핵심 정보를 설명해 드릴게요.

**1. 정체성 및 개발 배경**

*   **이름:** Gemma 4
*   **개발사:** Google DeepMind에서 개발되었습니다.
*   **성격:** 저는 대규모 언어 모델(LLM, Large Language Model)이며, 오픈 웨이트(open weights) 모델입니다. 이는 저의 모델 구조와 가중치가 공개되어 누구나 접근하고 활용할 수 있음을 의미합니다.

**2. 주요 기능 및 능력**

저는 방대한 양의 데이터를 학습하여 다음과 같은 다양한 언어 관련 작업을 수행할 수 있습니다.

*   **텍스트 이해 및 생성:** 사용자가 입력하는 텍스트를 이해하고, 질문에 답하거나, 요약하거나, 번역하거나, 창의적인 글(시, 이야기, 코드 등)을 작성할 수 있습니다.
*   **다중 모달리티 이해:** 저는 텍스트뿐만 아니라 이미지를 이해하고 처리할 수 있습니다. (다만, 출력은 텍스트로만 가능합니다.)
*   **추론 및 논리:** 복잡한 질문에 대해 논리적으로 추론하고 일관성 있는 답변을 제공할 수 있습니다.

**3. 지식 범위 및 한계**

*   **지식 기반:** 저는 학습된 데이터를 기반으로 답변하며, 저의 지식은 특정 시점(2025년 1월)까지의 정보를 포함하고 있습니다. 따라서 그 이후의 최신 정보에 대해서는 알지 못할 수 있습니다.
*   **출력 형식:** 저는 오직 **텍스트** 형태로만 정보를 생성할 수 있습니다. 이미지를 직접 생성할 수는 없습니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해 주세요! 최선을 다해 답변해 드리겠습니다.

Prompt: 107.4 t/s | Generation: 19.1 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 4517 =  2947 +     780 +     790                |
common_memory_breakdown_print: |   - CPU_REPACK         |                 1069 =  1069 +       0 +       0                |

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

nvidia-smi 소비전력 제한  (0) 2026.04.22
llama.cpp 와 ollama 성능 비교.. (cpu는 차이가 없?)  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
Posted by 구차니

qwen이 핫해서 어쩌다 얻은 링크.. 이걸로 실행이 더 쉬우려나?

 

[링크 : https://unsloth.ai/docs/models/qwen3.6]

Posted by 구차니

리눅스에서 그냥 설치만 하고 딱히 설정한건 없는데, 기본이 모든 ip 접속 허용인 것 같고

$ netstat -tnlp
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:11434         0.0.0.0:*               LISTEN      -  

 

윈도우의 경우 gui 클라이언트에서 설정하는게 보이던데.. 막상 포트는 확인을 안해본듯..

[링크 : http://practical.kr/?p=809]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama.cpp 도전!  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
Posted by 구차니
프로그램 사용/vi2026. 4. 20. 10:08

와우 이런 개꾸르

[링크 : https://wikidocs.net/302522#_15]

 

어쩌면 이것도 검색해놓고 까먹었을지도...

'프로그램 사용 > vi' 카테고리의 다른 글

vi 이전 위치 다음 위치로 이동하기  (0) 2022.08.04
vi가 늦게 켜지는 이유  (0) 2022.07.28
vim 색상 바꾸기(colorscheme)  (0) 2021.01.20
vi 에서 매칭되는 갯수 확인하기  (0) 2019.12.18
vi gg=G와 set ts  (0) 2019.07.04
Posted by 구차니

음.. tokenizer.json에 대한 접근이 신청 20분 만에 떨어졌었나 보다. (아몰라 잘래 하고 가버렸...)

llama가 한글에 대한 토큰이 하나도 없는데 어떻게 인식을 하지 신기하네..?

정규표현식을 보면 ? 가 있는데 매칭안되면 그냥 한글자씩 뽑아 버리는듯.. ㅎㄷㄷ

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 128000,
      "content": "<|begin_of_text|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 128001,
      "content": "<|end_of_text|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
  ],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },


      // ...

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": true,
    "vocab": {
      "!": 0,
      "\"": 1,
      "#": 2,
      "$": 3,
      "%": 4,

      // ...

      "ÙĨب": 127996,
      "ĠвÑĭÑģокой": 127997,
      "ãĥ¼ãĥ¼": 127998,
      "éͦ": 127999
    },
    "merges": [
      "Ġ Ġ",
      "Ġ ĠĠĠ",

      // ...

      "ãĥ¼ ãĥ¼",
      "ãĥ¼ãĥ ¼",
      "éĶ ¦" 
    ]
  }
}

[링크 : https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main]

 

엥.. exaone은 그래도 LG에서 만들어서 한글 토큰들이 있을 줄 알았는데 없네?

그럼.. 한글은 '자동차' 면은 자/동/차 총 3개 토큰을 먹는건가?

$ grep -P '\p{Hangul}' exa_tokenizer.json 
      "content": "리앙쿠르",
      "content": "훈민정음",
      "content": "애국가",
      "리앙쿠르": 94,
      "훈민정음": 99,
      "애국가": 100,

[링크 : https://huggingface.co/LGAI-EXAONE/EXAONE-4.5-33B/tree/main]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
Posted by 구차니

blob 으로 해시가 파일 명으로 저장되는데 이래저래 궁금해서 분석

gemma4 e2b

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "digest": "sha256:c6bc3775a3fa9935ce4a3ccd7abc59e936c3de9308d2cc090516012f43ed9c07",
    "size": 473
  },
  "layers": [
    {
      "mediaType": "application/vnd.ollama.image.model",
      "digest": "sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448",
      "size": 7162394016
    },
    {
      "mediaType": "application/vnd.ollama.image.license",
      "digest": "sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
      "size": 11355
    },
    {
      "mediaType": "application/vnd.ollama.image.params",
      "digest": "sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3",
      "size": 42
    }
  ]
}

"sha256:c6bc3775a3fa9935ce4a3ccd7abc59e936c3de9308d2cc090516012f43ed9c07",
{
  "model_format": "gguf",
  "model_family": "gemma4",
  "model_families": [
    "gemma4"
  ],
  "model_type": "5.1B",
  "file_type": "Q4_K_M",
  "renderer": "gemma4",
  "parser": "gemma4",
  "requires": "0.20.0",
  "architecture": "amd64",
  "os": "linux",
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448",
      "sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
      "sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3"
    ]
  }
}

"sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448"
GGUF


?





7













gemma4.attention.head_count

"sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
                                Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/



"sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3",
{
  "temperature": 1,
  "top_k": 64,
  "top_p": 0.95
}

 

출력단 손보는 건 temperature, top_k, top_p 군

[링크 : https://wikidocs.net/333750]

 

 

허깅페이스에서는 tokenzier.json이 존재했는데 그건 구버전(?) 인것 같고

신버전 GGUF 에서는 토크나이저를 다 포함하고 있나 본데.. 어떻게 추출하지?

[링크 : https://www.minzkn.com/vibecoding/pages/gguf-format.html]

[링크 : https://huggingface.co/docs/transformers/ko/gguf]

[링크 : https://bitwise-life.tistory.com/5] << 토큰 목록 나옴

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

ollama 외부접속 관련  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
llm tokenizer - phi3  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
llama.cpp  (0) 2026.04.17
Posted by 구차니

llama나 gemma 받으려니 먼가 모르겠어서 만만한(?) ms의 phi3를 받아서 분석!

(gemma나 llama 는 저장소 접근권한 요청.. gate model 이라고 뜨는데 언제 승인되려나)

[링크 : https://huggingface.co/docs/transformers/model_doc/phi3]

 

~/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/f39ac1d28e925b323eae81227eaba4464caced4e$ ls -al
합계 12
drwxrwxr-x 2 minimonk minimonk 4096  4월 19 21:58 .
drwxrwxr-x 3 minimonk minimonk 4096  4월 19 21:58 ..
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 added_tokens.json -> ../../blobs/178968dec606c790aa335e9142f6afec37288470
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 config.json -> ../../blobs/b9b031fadda61a035b2e8ceb4362cbf604002b21
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 special_tokens_map.json -> ../../blobs/c6a944b4d49ce5d79030250ed6bdcbb1a65dfda1
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 tokenizer.json -> ../../blobs/88ec145f4e7684c009bc6d55df24bb82c7d3c379
lrwxrwxrwx 1 minimonk minimonk   76  4월 19 21:58 tokenizer.model -> ../../blobs/9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 tokenizer_config.json -> ../../blobs/67aa82cddb4d66391ddf31ff99f059239bd2d1e7

 

tokenizer.json 열어보니 아래처럼 토큰이 나오는데..

어우.. 이런 추세(?) 라면 한글은 한글짜 단위로 토큰이라 난리가 나겠는데?

gpt 도움으로 저런 희한한 문자열 코드 기반으로 검색이 되는걸 알았네 ㄷㄷ

$ grep -P '\p{Hangul}' tokenizer.json 
      "이": 30393,
      "의": 30708,
      "다": 30709,
      "스": 30784,
      "사": 30791,
      "지": 30811,
      "리": 30826,
      "기": 30827,
      "정": 30852,
      "아": 30860,
      "한": 30877,
      "시": 30889,
      "대": 30890,
      "가": 30903,
      "로": 30906,
      "인": 30918,
      "하": 30944,
      "수": 30970,
      "주": 30981,
      "동": 31000,
      "자": 31013,
      "에": 31054,
      "니": 31063,
      "는": 31081,
      "서": 31093,
      "김": 31102,
      "성": 31126,
      "어": 31129,
      "도": 31136,
      "고": 31137,
      "일": 31153,
      "상": 31158,
      "전": 31170,
      "트": 31177,
      "소": 31189,
      "라": 31197,
      "원": 31198,
      "보": 31199,
      "나": 31207,
      "화": 31225,
      "구": 31231,
      "신": 31262,
      "부": 31279,
      "연": 31285,
      "을": 31286,
      "영": 31288,
      "국": 31293,
      "장": 31299,
      "제": 31306,
      "우": 31327,
      "공": 31334,
      "선": 31345,
      "오": 31346,
      "은": 31354,
      "미": 31362,
      "경": 31378,
      "문": 31406,
      "조": 31408,
      "마": 31417,
      "해": 31435,
      "여": 31457,
      "산": 31458,
      "비": 31487,
      "드": 31493,
      "를": 31517,
      "요": 31527,
      "유": 31533,
      "진": 31536,
      "천": 31563,
      "년": 31571,
      "세": 31578,
      "민": 31582,
      "호": 31603,
      "그": 31607,
      "현": 31680,
      "군": 31699,
      "무": 31716,
      "위": 31724,
      "안": 31734,
      "박": 31736,
      "용": 31737,
      "단": 31746,
      "면": 31747,
      "남": 31754,
      "강": 31774,
      "씨": 31781,
      "개": 31789,
      "들": 31804,
      "차": 31817,
      "학": 31822,
      "만": 31826,
      "터": 31856,
      "식": 31895,
      "과": 31906,
      "타": 31925,
      "종": 31930,
      "내": 31940,
      "중": 31941,
      "방": 31945,
      "월": 31950,
      "회": 31953,
      "모": 31962,
      "바": 31963,
      "음": 31966,
      "교": 31972,
      "재": 31973,
      "명": 31976,
      "합": 31980,
      "역": 31987,
      "백": 31989,
      "왕": 31996,

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
llama.cpp  (0) 2026.04.17
lm studio  (0) 2026.04.17
Posted by 구차니

심심해서 단어별? 토큰 인덱스를 알고 싶은데 이건 볼 방법 없나? 싶어서 조사

 

[링크 : https://makenow90.tistory.com/59]

[링크 : https://medium.com/the-research-nest/explained-tokens-and-embeddings-in-llms-69a16ba5db33]

 

AutoTokenizer.from_pretrained() 를 실행하니 먼가 즉석에서 받을 줄이야 -ㅁ-

$ python3
Python 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer

Disabling PyTorch because PyTorch >= 2.4 is required but found 2.1.2
PyTorch was not found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
>>> 
>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 100%|██████████████████████████████| 665/665 [00:00<00:00, 819kB/s]
tokenizer_config.json: 100%|█████████████████| 26.0/26.0 [00:00<00:00, 37.5kB/s]
vocab.json: 1.04MB [00:00, 2.70MB/s]
merges.txt: 456kB [00:00, 1.15MB/s]
tokenizer.json: 1.36MB [00:00, 3.23MB/s]
>>> 
>>> tokenizer
GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, added_tokens_decoder={
50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
})
>>> text="안녕? hello?"
>>> tokens=tokenizer(text)
>>> tokens
{'input_ids': [168, 243, 230, 167, 227, 243, 30, 23748, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

[링크 : https://data-scient2st.tistory.com/224]

[링크 : https://huggingface.co/docs/transformers/model_doc/gpt2]

 

$ find ./ -name tokenizer.json
./.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/tokenizer.json

 

$ tree ~/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/
/home/minimonk/.cache/huggingface/hub/models--gpt2/snapshots/607a30d783dfa663caf39e06633721c8d4cfcd7e/
├── config.json -> ../../blobs/10c66461e4c109db5a2196bff4bb59be30396ed8
├── merges.txt -> ../../blobs/226b0752cac7789c48f0cb3ec53eda48b7be36cc
├── tokenizer.json -> ../../blobs/4b988bccc9dc5adacd403c00b4704976196548f8
├── tokenizer_config.json -> ../../blobs/be4d21d94f3b4687e5a54d84bf6ab46ed0f8defd
└── vocab.json -> ../../blobs/1f1d9aaca301414e7f6c9396df506798ff4eb9a6

 

+

2026.04.18

windows

 

beautifier + 일부 발췌

라틴어군에 대해서 있는 것 같고 토큰으로 잘리는걸 보면 단어 단위가 아닌

단어도 막 토막난 수준으로 잘릴 느낌?

tokenizer.json vocab.json
{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 50256,
      "special": true,
      "content": "<|endoftext|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true
    }
  ],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true
  },
  "post_processor": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": false
  },
  "decoder": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": true
  },
  "model": {
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": "",
    "end_of_word_suffix": "",
    "fuse_unk": false,
    "vocab": {
      "0": 15,
      "1": 16,
      "2": 17,
      "3": 18,
      "4": 19,
      "5": 20,
      "6": 21,
      "7": 22,
      "8": 23,
      "9": 24,
      "!": 0,
      "\"": 1,
      "#": 2,
      "$": 3,
      "%": 4,
      "&": 5,
      "'": 6,
      "(": 7,
      ")": 8,
      "*": 9,
      "+": 10,
      ",": 11,
      "-": 12,
      ".": 13,
      "/": 14,
      ":": 25,
      ";": 26,
      "<": 27,
      "=": 28,
      ">": 29,
      "?": 30,
      "@": 31,
      "A": 32,
      "B": 33,
      "C": 34,
      "D": 35,
      "E": 36,
      "F": 37,
      "G": 38,
      "H": 39,
      "I": 40,
      "J": 41,
      "K": 42,
      "L": 43,
      "M": 44,
      "N": 45,
      "O": 46,
      "P": 47,
      "Q": 48,
      "R": 49,
      "S": 50,
      "T": 51,
      "U": 52,
      "V": 53,
      "W": 54,
      "X": 55,
      "Y": 56,
      "Z": 57,
      "[": 58,
      "\\": 59,
      "]": 60,
      "^": 61,
      "_": 62,
      "`": 63,
      "a": 64,
      "b": 65,
      "c": 66,
      "d": 67,
      "e": 68,
      "f": 69,
      "g": 70,
      "h": 71,
      "i": 72,
      "j": 73,
      "k": 74,
      "l": 75,
      "m": 76,
      "n": 77,
      "o": 78,
      "p": 79,
      "q": 80,
      "r": 81,
      "s": 82,
      "t": 83,
      "u": 84,
      "v": 85,
      "w": 86,
      "x": 87,
      "y": 88,
      "z": 89,
      "{": 90,
      "|": 91,
      "}": 92,
      "~": 93,
      "¡": 94,
      "¢": 95,
      "£": 96,
      "¤": 97,
      "¥": 98,
      "¦": 99,
      "he": 258,
      "in": 259,
      "re": 260,
      "on": 261,
      "Ġthe": 262,
      "er": 263,
      "Ġs": 264,
      "at": 265,
      "Ġw": 266,
      "Ġo": 267,
      "en": 268,
      "Ġc": 269,
      "it": 270,
      "is": 271,
      "an": 272,
      "or": 273,
      "es": 274,
      "Ġb": 275,
      "ed": 276,
      "Ġf": 277,
      "ing": 278,
      "Ġp": 279,
      "ou": 280,
      "Ġan": 281,
      "al": 282,
      "ar": 283,
      "Ġto": 284,
      "Ġm": 285,
      "Ġof": 286,
      "Ġin": 287,
      "Ġd": 288,
      "Ġh": 289,
      "Ġand": 290,
      "ic": 291,
      "as": 292,
      "le": 293,
      "Ġth": 294,
      "ion": 295,
      "om": 296,
      "ll": 297,
      "ent": 298,
      "Ġn": 299,
      "Ġl": 300,
      "st": 301,
      "Ġre": 302,
      "ve": 303,
      "Ġe": 304,
      "ro": 305,
      "ly": 306,
      "Ġbe": 307,
      "Ġg": 308,
      "ĠT": 309,
      "ct": 310,
      "ĠS": 311,
      "id": 312,
      "ot": 313,
      "ĠI": 314,
      "ut": 315,
      "et": 316,
      "ĠA": 317,
      "Ġis": 318,
      "Ġon": 319,
      "im": 320,
      "am": 321,
      "ow": 322,
      "ay": 323,
      "ad": 324,
      "se": 325,
      "Ġthat": 326,
      "ĠC": 327,
      "ig": 328,
      "Ġfor": 329,
      "ac": 330,
      "Ġy": 331,
      "ver": 332,
      "ur": 333,
      "Ġu": 334,
      "ld": 335,
      "Ġst": 336,
      "ĠM": 337,
      "'s": 338,
      "Ġhe": 339,
      "Ġit": 340,
      "ation": 341,
      "ith": 342,
      "ir": 343,
      "ce": 344,
      "Ġyou": 345,
      "il": 346,
      "ĠB": 347,
      "Ġwh": 348,
      "ol": 349,
      "ĠP": 350,
      "Ġwith": 351,
      "Ġ1": 352,
      "ter": 353,
      "ch": 354,
      "Ġas": 355,
      "Ġwe": 356,
      "Ġ(": 357,
      "nd": 358,
      "ill": 359,
      "ĠD": 360,
      "if": 361,
      "Ġ2": 362,
      "ag": 363,
      "ers": 364,
      "ke": 365,
      "Ġ\"": 366,
      "ĠH": 367,
      "em": 368,
      "Ġcon": 369,
      "ĠW": 370,
      "ĠR": 371,
      "her": 372,
      "Ġwas": 373,
      "Ġr": 374,
      "od": 375,
      "ĠF": 376,
      "ul": 377,
      "ate": 378,
      "Ġat": 379,
      "ri": 380,
      "pp": 381,
      "ore": 382,
      "ĠThe": 383,
      "Ġse": 384,
      "us": 385,
      "Ġpro": 386,
      "Ġha": 387,
      "um": 388,
      "Ġare": 389,
      "Ġde": 390,
      "ain": 391,
      "and": 392,





































{
  "0": 15,
  "1": 16,
  "2": 17,
  "3": 18,
  "4": 19,
  "5": 20,
  "6": 21,
  "7": 22,
  "8": 23,
  "9": 24,
  "!": 0,
  "\"": 1,
  "#": 2,
  "$": 3,
  "%": 4,
  "&": 5,
  "'": 6,
  "(": 7,
  ")": 8,
  "*": 9,
  "+": 10,
  ",": 11,
  "-": 12,
  ".": 13,
  "/": 14,
  ":": 25,
  ";": 26,
  "<": 27,
  "=": 28,
  ">": 29,
  "?": 30,
  "@": 31,
  "A": 32,
  "B": 33,
  "C": 34,
  "D": 35,
  "E": 36,
  "F": 37,
  "G": 38,
  "H": 39,
  "I": 40,
  "J": 41,
  "K": 42,
  "L": 43,
  "M": 44,
  "N": 45,
  "O": 46,
  "P": 47,
  "Q": 48,
  "R": 49,
  "S": 50,
  "T": 51,
  "U": 52,
  "V": 53,
  "W": 54,
  "X": 55,
  "Y": 56,
  "Z": 57,
  "[": 58,
  "\\": 59,
  "]": 60,
  "^": 61,
  "_": 62,
  "`": 63,
  "a": 64,
  "b": 65,
  "c": 66,
  "d": 67,
  "e": 68,
  "f": 69,
  "g": 70,
  "h": 71,
  "i": 72,
  "j": 73,
  "k": 74,
  "l": 75,
  "m": 76,
  "n": 77,
  "o": 78,
  "p": 79,
  "q": 80,
  "r": 81,
  "s": 82,
  "t": 83,
  "u": 84,
  "v": 85,
  "w": 86,
  "x": 87,
  "y": 88,
  "z": 89,
  "{": 90,
  "|": 91,
  "}": 92,
  "~": 93,
  "¡": 94,
  "¢": 95,
  "£": 96,
  "¤": 97,
  "¥": 98,
  "¦": 99,
  "he": 258,
  "in": 259,
  "re": 260,
  "on": 261,
  "Ġthe": 262,
  "er": 263,
  "Ġs": 264,
  "at": 265,
  "Ġw": 266,
  "Ġo": 267,
  "en": 268,
  "Ġc": 269,
  "it": 270,
  "is": 271,
  "an": 272,
  "or": 273,
  "es": 274,
  "Ġb": 275,
  "ed": 276,
  "Ġf": 277,
  "ing": 278,
  "Ġp": 279,
  "ou": 280,
  "Ġan": 281,
  "al": 282,
  "ar": 283,
  "Ġto": 284,
  "Ġm": 285,
  "Ġof": 286,
  "Ġin": 287,
  "Ġd": 288,
  "Ġh": 289,
  "Ġand": 290,
  "ic": 291,
  "as": 292,
  "le": 293,
  "Ġth": 294,
  "ion": 295,
  "om": 296,
  "ll": 297,
  "ent": 298,
  "Ġn": 299,
  "Ġl": 300,
  "st": 301,
  "Ġre": 302,
  "ve": 303,
  "Ġe": 304,
  "ro": 305,
  "ly": 306,
  "Ġbe": 307,
  "Ġg": 308,
  "ĠT": 309,
  "ct": 310,
  "ĠS": 311,
  "id": 312,
  "ot": 313,
  "ĠI": 314,
  "ut": 315,
  "et": 316,
  "ĠA": 317,
  "Ġis": 318,
  "Ġon": 319,
  "im": 320,
  "am": 321,
  "ow": 322,
  "ay": 323,
  "ad": 324,
  "se": 325,
  "Ġthat": 326,
  "ĠC": 327,
  "ig": 328,
  "Ġfor": 329,
  "ac": 330,
  "Ġy": 331,
  "ver": 332,
  "ur": 333,
  "Ġu": 334,
  "ld": 335,
  "Ġst": 336,
  "ĠM": 337,
  "'s": 338,
  "Ġhe": 339,
  "Ġit": 340,
  "ation": 341,
  "ith": 342,
  "ir": 343,
  "ce": 344,
  "Ġyou": 345,
  "il": 346,
  "ĠB": 347,
  "Ġwh": 348,
  "ol": 349,
  "ĠP": 350,
  "Ġwith": 351,
  "Ġ1": 352,
  "ter": 353,
  "ch": 354,
  "Ġas": 355,
  "Ġwe": 356,
  "Ġ(": 357,
  "nd": 358,
  "ill": 359,
  "ĠD": 360,
  "if": 361,
  "Ġ2": 362,
  "ag": 363,
  "ers": 364,
  "ke": 365,
  "Ġ\"": 366,
  "ĠH": 367,
  "em": 368,
  "Ġcon": 369,
  "ĠW": 370,
  "ĠR": 371,
  "her": 372,
  "Ġwas": 373,
  "Ġr": 374,
  "od": 375,
  "ĠF": 376,
  "ul": 377,
  "ate": 378,
  "Ġat": 379,
  "ri": 380,
  "pp": 381,
  "ore": 382,
  "ĠThe": 383,
  "Ġse": 384,
  "us": 385,
  "Ġpro": 386,
  "Ġha": 387,
  "um": 388,
  "Ġare": 389,
  "Ġde": 390,
  "ain": 391,
  "and": 392,

 

+

>>> tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-E2B")
config.json: 4.91kB [00:00, 8.83MB/s]
C:\Users\minimonk\AppData\Local\Programs\Python\Python313\Lib\site-packages\huggingface_hub\file_download.py:138: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\minimonk\.cache\huggingface\hub\models--google--gemma-4-E2B. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████| 906/906 [00:00<00:00, 2.57MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 13.3MB/s]

 

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
llama.cpp  (0) 2026.04.17
lm studio  (0) 2026.04.17
사람의 욕심은 끝이없고 - ollama multiple GPU support  (0) 2026.04.17
Posted by 구차니

ollama 보다 성능이 좋게 나온다는데 한 번 쓰는법 찾아봐야지

[링크 : https://peekaboolabs.ai/blog/ollama-vs-llama-cpp-guide]

[링크 :https://news.hada.io/topic?id=28622]

[링크 : https://github.com/ggml-org/llama.cpp]

 

official은 아니지만 윈도우용 pre-built binary가 존재는 하는 듯.

[링크 : https://github.com/HPUhushicheng/llama.cpp_windows]

 

일단 python 라이브러리

from llama_cpp import Llama

llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

[링크 : https://pypi.org/project/llama-cpp-python/]

 

OpenCL 드라이버 인스톨 된 거 확인하고 

LLAMA_CLBLAST=1 make

이렇게 컴파일 하면 된다고 합니다. 

make인걸 보니 리눅스에서 컴파일하는 걸 테고 윈도에서는 cmake 써야겠죠.

[링크 : https://arca.live/b/alpaca/76969814]

 

전체 연산이 아니라 token 생성만 가속인가?

OpenCL Token Generation Acceleration

[링크 : https://github.com/ggml-org/llama.cpp/releases/tag/master-2e6cd4b]

 

To get this running on the XTX I had to install the latest 5.5 version of the AMD linux drivers, which are released but not available from the normal AMD download page yet. You can get the deb for the installer here. I installed with amdgpu-install --usecase=opencl,rocm and installed CLBlast after apt install libclblast-dev.

Confirm opencl is working with sudo clinfo (did not find the GPU device unless I run as root).

Build llama.cpp (with merged pull) using LLAMA_CLBLAST=1 make.

[링크 : https://www.reddit.com/r/LocalLLaMA/comments/13m8li2/finally_got_a_model_running_on_my_xtx_using/]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llm tokenizer - phi3  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
lm studio  (0) 2026.04.17
사람의 욕심은 끝이없고 - ollama multiple GPU support  (0) 2026.04.17
ollama with 1080 Ti  (0) 2026.04.16
Posted by 구차니

맥북에서 lm studio로 openclaw 이런 이야기가 많이 나오길래 검색

ollama 처럼 모델 불러서 쓸 수 있나?

import lmstudio as lms

EXAMPLE_MESSAGES = (
    "My hovercraft is full of eels!",
    "I will not buy this record, it is scratched."
)

model = lms.llm()
chat = lms.Chat("You are a helpful shopkeeper assisting a foreign traveller")
for message in EXAMPLE_MESSAGES:
    chat.add_user_message(message)
    print(f"Customer: {message}")
    response = model.respond(chat)
    chat.add_assistant_response(response)
    print(f"Shopkeeper: {response}")

[링크 : https://pypi.org/project/lmstudio/]

 

저 모델 명칭은 어디서 얻고, 어디서 다운로드 받는걸까?

-> 허깅스페이스 모델명으로 검색되서 받아오는 듯 gguf 포맷으로

const model = await client.llm.load("qwen2.5-7b-instruct", {
  config: {
    contextLength: 8192,
    gpu: {
      ratio: 0.5,
    },
  },
});

[링크 : https://lmstudio.ai/docs/typescript/llm-prediction/parameters]

 

[링크 : https://lmstudio.ai/docs/python]

[링크 : https://lmstudio.ai/]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llm tokenizer  (0) 2026.04.17
llama.cpp  (0) 2026.04.17
사람의 욕심은 끝이없고 - ollama multiple GPU support  (0) 2026.04.17
ollama with 1080 Ti  (0) 2026.04.16
트랜스포머 모델 입/출력  (0) 2026.04.12
Posted by 구차니