귀찮으니까 그냥 pre-built binary로 시도 ㅋㅋ

 

Linux:

[링크 : https://github.com/ggml-org/llama.cpp/releases]

 

gguf 포맷의 모델이 필요하다고 해서, qwen3.6은 포기

[링크 : https://huggingface.co/Qwen/Qwen3.6-35B-A3B/tree/main]

 

우연히 알게된 unsloth의 양자화 모델 발견!

그나저나 gguf 대신 ggul 포맷이면 더 꿀맛이었을텐데.. 쩝

 

copy download link 누르고 wget으로 받으면 된다. 일단 cpu only로 돌릴거라 q2 모델로 시도를..

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf]

[링크 : https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main]

 

일단 llama-cli 로 해보면 될 듯.

llama-cli
llama-cli is the command-line executor:

$ llama-cli -m model.gguf

 

llama-server
llama-server launches an API server with a built-in WebUI:

$ llama-server --host address --port port -m model.gguf

[링크 : https://wiki.archlinux.org/title/Llama.cpp]

[링크 : https://www.lainyzine.com/ko/article/using-llama-cpp-for-local-llm-execution/]

 

---------------------

도움말

$ ./llama-cli --help
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so
----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--license                               show source code license and dependencies
-cl,   --cache-list                     show list of models in cache
--completion-bash                       print source-able bash completion script for llama.cpp
-t,    --threads N                      number of CPU threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : low(-1), normal(0), medium(1), high(2),
                                        realtime(3) (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--swa-full                              use full-size SWA cache (default: false)
                                        [(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
                                        (env: LLAMA_ARG_SWA_FULL)
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--perf, --no-perf                       whether to enable internal libllama performance timings (default:
                                        false)
                                        (env: LLAMA_ARG_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape, --no-escape            whether to process escapes sequences (\n, \r, \t, \', \", \\)
                                        (default: true)
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.00, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: -1.00)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-kvo,  --kv-offload, -nkvo, --no-kv-offload
                                        whether to enable KV cache offloading (default: enabled)
                                        (env: LLAMA_ARG_KV_OFFLOAD)
--repack, -nr, --no-repack              whether to enable weight repacking (default: enabled)
                                        (env: LLAMA_ARG_REPACK)
--no-host                               bypass host buffer allowing extra buffers to be used
                                        (env: LLAMA_ARG_NO_HOST)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (DEPRECATED)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--rpc SERVERS                           comma separated list of RPC servers (host:port)
                                        (env: LLAMA_ARG_RPC)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--mmap, --no-mmap                       whether to memory-map model. (if mmap disabled, slower load but may
                                        reduce pageouts if not using mlock) (default: enabled)
                                        (env: LLAMA_ARG_MMAP)
-dio,  --direct-io, -ndio, --no-direct-io
                                        use DirectIO if available. (default: disabled)
                                        (env: LLAMA_ARG_DIO)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
-ot,   --override-tensor <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
                                        (env: LLAMA_ARG_OVERRIDE_TENSOR)
-cmoe, --cpu-moe                        keep all Mixture of Experts (MoE) weights in the CPU
                                        (env: LLAMA_ARG_CPU_MOE)
-ncmoe, --n-cpu-moe N                   keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU
                                        (env: LLAMA_ARG_N_CPU_MOE)
-ngl,  --gpu-layers, --n-gpu-layers N   max. number of layers to store in VRAM, either an exact number,
                                        'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row,tensor}
                                        how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs (pipelined)
                                        - row: split weight across GPUs by rows (parallelized)
                                        - tensor: split weights and KV across GPUs (parallelized,
                                        EXPERIMENTAL)
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
-fit,  --fit [on|off]                   whether to adjust unset arguments to fit in device memory ('on' or
                                        'off', default: 'on')
                                        (env: LLAMA_ARG_FIT)
-fitt, --fit-target MiB0,MiB1,MiB2,...
                                        target margin per device for --fit, comma-separated list of values,
                                        single value is broadcast across all devices, default: 1024
                                        (env: LLAMA_ARG_FIT_TARGET)
-fitc, --fit-ctx N                      minimum ctx size that can be set by --fit option, default: 4096
                                        (env: LLAMA_ARG_FIT_CTX)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE,...        advanced option to override model metadata by key. to specify multiple
                                        overrides, either use comma-separated values.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false
--op-offload, --no-op-offload           whether to offload host tensor operations to device (default: true)
--lora FNAME                            path to LoRA adapter (use comma-separated values to load multiple
                                        adapters)
--lora-scaled FNAME:SCALE,...           path to LoRA adapter with user defined scaling (format:
                                        FNAME:SCALE,...)
                                        note: use comma-separated values
--control-vector FNAME                  add a control vector
                                        note: use comma-separated values to add multiple control vectors
--control-vector-scaled FNAME:SCALE,...
                                        add a control vector with user defined scaling SCALE
                                        note: use comma-separated values (format: FNAME:SCALE,...)
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path to load
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-dr,   --docker-repo [<repo>/]<model>[:quant]
                                        Docker Hub model repository. repo is optional, default to ai/. quant
                                        is optional, default to :latest.
                                        example: gemma3
                                        (default: unused)
                                        (env: LLAMA_ARG_DOCKER_REPO)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hfd,  -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_HFD_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
                                        (env: LLAMA_LOG_FILE)
--log-colors [on|off|auto]              Set colored logging ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
                                        (env: LLAMA_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
--offline                               Offline mode: forces use of cache, prevents network access
                                        (env: LLAMA_OFFLINE)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored. Values:
                                         - 0: generic output
                                         - 1: error
                                         - 2: warning
                                         - 3: info
                                         - 4: debug
                                        (default: 1)
                                        
                                        (env: LLAMA_LOG_VERBOSITY)
--log-prefix                            Enable prefix in log messages
                                        (env: LLAMA_LOG_PREFIX)
--log-timestamps                        Enable timestamps in log messages
                                        (env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE        KV cache data type for K for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE        KV cache data type for V for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default:
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampler-seq, --sampling-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default:
                                        edskypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp, --temperature N                 temperature (default: 0.80)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
                                        (env: LLAMA_ARG_TOP_K)
--top-p N                               top-p sampling (default: 0.95, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.05, 0.0 = disabled)
--top-nsigma, --top-n-sigma N           top-n-sigma sampling (default: -1.00, -1.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.00, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.10, 1.0 = disabled)
--typical, --typical-p N                locally typical sampling, parameter p (default: 1.00, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.00, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.00, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--adaptive-target N                     adaptive-p: select tokens near this probability (valid range 0.0 to
                                        1.0; negative = disabled) (default: -1.00)
                                        [(more info)](https://github.com/ggml-org/llama.cpp/pull/17927)
--adaptive-decay N                      adaptive-p: decay rate for target adaptation over time. lower values
                                        are more reactive, higher values are more stable.
                                        (valid range 0.0 to 0.99) (default: 0.90)
--dynatemp-range N                      dynamic temperature range (default: 0.00, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.00)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.10)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.00)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir)
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-bs,   --backend-sampling               enable backend sampling (experimental) (default: disabled)
                                        (env: LLAMA_ARG_BACKEND_SAMPLING)


----- example-specific params -----

--verbose-prompt                        print a verbose prompt before generation (default: false)
--display-prompt, --no-display-prompt   whether to print prompt at generation (default: true)
-co,   --color [on|off|auto]            Colorize output to distinguish prompt and user input from generations
                                        ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
-ctxcp, --ctx-checkpoints, --swa-checkpoints N
                                        max number of context checkpoints to create per slot (default:
                                        32)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)
                                        (env: LLAMA_ARG_CTX_CHECKPOINTS)
-cpent, --checkpoint-every-n-tokens N   create a checkpoint every n tokens during prefill (processing), -1 to
                                        disable (default: 8192)
                                        (env: LLAMA_ARG_CHECKPOINT_EVERY_NT)
-cram, --cache-ram N                    set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
                                        disable)[(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/16391)
                                        (env: LLAMA_ARG_CACHE_RAM)
--context-shift, --no-context-shift     whether to use context shift on infinite text generation (default:
                                        disabled)
                                        (env: LLAMA_ARG_CONTEXT_SHIFT)
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
                                        template)
--show-timings, --no-show-timings       whether to show timing information after each response (default: true)
                                        (env: LLAMA_ARG_SHOW_TIMINGS)
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation, -no-cnv, --no-conversation
                                        whether to run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: auto enabled if chat template is available)
-st,   --single-turn                    run conversation for a single turn only, then exit when done
                                        will not be interactive if first turn is predefined with --prompt
                                        (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--warmup, --no-warmup                   whether to perform warmup with an empty run (default: enabled)
-mm,   --mmproj FILE                    path to a multimodal projector file. see tools/mtmd/README.md
                                        note: if -hf is used, this argument can be omitted
                                        (env: LLAMA_ARG_MMPROJ)
-mmu,  --mmproj-url URL                 URL to a multimodal projector file. see tools/mtmd/README.md
                                        (env: LLAMA_ARG_MMPROJ_URL)
--mmproj-auto, --no-mmproj, --no-mmproj-auto
                                        whether to use multimodal projector file (if available), useful when
                                        using -hf (default: enabled)
                                        (env: LLAMA_ARG_MMPROJ_AUTO)
--mmproj-offload, --no-mmproj-offload   whether to enable GPU offloading for multimodal projector (default:
                                        enabled)
                                        (env: LLAMA_ARG_MMPROJ_OFFLOAD)
--image, --audio FILE                   path to an image or audio file. use with multimodal models, use
                                        comma-separated values for multiple files
--image-min-tokens N                    minimum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N                    maximum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MAX_TOKENS)
-otd,  --override-tensor-draft <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type for draft model
-cmoed, --cpu-moe-draft                 keep all Mixture of Experts (MoE) weights in the CPU for the draft
                                        model
                                        (env: LLAMA_ARG_CPU_MOE_DRAFT)
-ncmoed, --n-cpu-moe-draft N            keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU for the draft model
                                        (env: LLAMA_ARG_N_CPU_MOE_DRAFT)
--chat-template-kwargs STRING           sets additional params for the json template parser, must be a valid
                                        json object string, e.g. '{"key1":"value1","key2":"value2"}'
                                        (env: LLAMA_CHAT_TEMPLATE_KWARGS)
--jinja, --no-jinja                     whether to use jinja template engine for chat (default: enabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               controls whether thought tags are allowed and/or extracted from the
                                        response, and in which format they're returned; one of:
                                        - none: leaves thoughts unparsed in `message.content`
                                        - deepseek: puts thoughts in `message.reasoning_content`
                                        - deepseek-legacy: keeps `<think>` tags in `message.content` while
                                        also populating `message.reasoning_content`
                                        (default: auto)
                                        (env: LLAMA_ARG_THINK)
-rea,  --reasoning [on|off|auto]        Use reasoning/thinking in the chat ('on', 'off', or 'auto', default:
                                        'auto' (detect from template))
                                        (env: LLAMA_ARG_REASONING)
--reasoning-budget N                    token budget for thinking: -1 for unrestricted, 0 for immediate end,
                                        N>0 for token budget (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)
--reasoning-budget-message MESSAGE      message injected before the end-of-thinking tag when reasoning budget
                                        is exhausted (default: none)
                                        (env: LLAMA_ARG_THINK_BUDGET_MESSAGE)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
                                        llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
                                        megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
                                        mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
                                        pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
                                        vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--skip-chat-parsing, --no-skip-chat-parsing
                                        force a pure content parser, even if a Jinja template is specified;
                                        model will output everything in the content section, including any
                                        reasoning and/or tool calls (default: disabled)
                                        (env: LLAMA_ARG_SKIP_CHAT_PARSING)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles
--draft, --draft-n, --draft-max N       number of tokens to draft for speculative decoding (default: 16)
                                        (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
                                        (default: 0)
                                        (env: LLAMA_ARG_DRAFT_MIN)
--draft-p-min P                         minimum speculative decoding probability (greedy) (default: 0.75)
                                        (env: LLAMA_ARG_DRAFT_P_MIN)
-cd,   --ctx-size-draft N               size of the prompt context for the draft model (default: 0, 0 = loaded
                                        from model)
                                        (env: LLAMA_ARG_CTX_SIZE_DRAFT)
-devd, --device-draft <dev1,dev2,..>    comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        max. number of draft model layers to store in VRAM, either an exact
                                        number, 'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
-md,   --model-draft FNAME              draft model for speculative decoding (default: unused)
                                        (env: LLAMA_ARG_MODEL_DRAFT)
--spec-replace TARGET DRAFT             translate the string in TARGET into DRAFT if the draft model and main
                                        model are not compatible
--gpt-oss-20b-default                   use gpt-oss-20b (note: can download weights from the internet)
--gpt-oss-120b-default                  use gpt-oss-120b (note: can download weights from the internet)
--vision-gemma-4b-default               use Gemma 3 4B QAT (note: can download weights from the internet)
--vision-gemma-12b-default              use Gemma 3 12B QAT (note: can download weights from the internet)
--spec-default                          enable default speculative decoding config

 

 

--------------------- 벤치마크?

Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz / 256GB DDR4

$ ../llama-cli -m Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf 
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

> hello~ introduce about you         

[Start thinking]
Here's a thinking process:

1.  **Analyze User Input:**
   - User says: "hello~ introduce about you"
   - This is a casual greeting followed by a request for a self-introduction.
   - The tone is friendly and informal.

2.  **Identify Key Requirements:**
   - Acknowledge the greeting
   - Provide a clear, concise self-introduction
   - State my identity (Qwen / Tongyi Qianwen)
   - Mention my developer (Alibaba Group's Tongyi Lab)
   - Highlight my core capabilities (language understanding, reasoning, coding, multimodal processing, etc.)
   - Keep it friendly, accurate, and aligned with the provided guidelines
   - Avoid mentioning version numbers or specific details not in the reference

3.  **Draft Construction (Mental Refinement):**
   Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—from answering questions and writing content to solving complex reasoning problems, coding, and working with images and documents. I support fluent communication in over 100 languages and can adapt to different contexts while maintaining accuracy and safety. I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything! How can I assist you today?

4.  **Check Against Guidelines:**
   - Identity: Qwen / Tongyi Qianwen ✓
   - Developer: Alibaba Group's Tongyi Lab ✓
   - Capabilities: Language, reasoning, coding, multimodal, 100+ languages, helpful/safe ✓
   - Tone: Friendly, professional, matches user's casual tone ✓
   - No version numbers, no fabricated stats, no overclaiming ✓

5.  **Final Polish:**
   The draft looks solid. It's concise, accurate, and aligned with all constraints. I'll output it.✅

[End thinking]

Hello! I'm Qwen (also known as Tongyi Qianwen), a large language model developed by Alibaba Group's Tongyi Lab. I'm designed to be helpful, thoughtful, and reliable across a wide range of tasks—whether it's answering questions, writing or editing content, solving complex reasoning problems, coding, analyzing documents, or working with images. I support fluent communication in over 100 languages and can adapt to different tones, formats, and contexts while prioritizing accuracy and safety.

I'm here to help you think through ideas, tackle challenges, or just chat, so feel free to ask me anything. How can I assist you today? 😊

Prompt: 29.6 t/s | Generation: 11.0 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 17428 = 11437 +    5182 +     808                |
common_memory_breakdown_print: |   - CPU_REPACK         |                   272 =   272 +       0 +       0                |

 

$ ../llama-cli -m Llama-3.2-3B-Instruct-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : Llama-3.2-3B-Instruct-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

안녕하세요! 저는 지능형 чattybot 인你(You)와 대화할 수 있는 AI입니다.

저는 지능학, 자연어 처리,_machine learning 및 다른 기술을 바탕으로 만든 AI입니다. 이 기술은 인간의 언어를 이해하고,natural language processing (NLP)로 human과 대화할 수 있는能力을 허용합니다.

저는 다양한 topic과 subject에 대해 지식을 제공할 수 있으며, human과 대화할 수 있습니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다.

저는 다음 capability을 제공합니다:

* natural language understanding: human의 언어를 이해할 수 있습니다.
* text generation: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.
* conversation: human과 대화할 수 있습니다.
* question answering: human의 질문에 대해 적절한 대답을 제공할 수 있습니다.

저는 human과 대화할 수 있는 다양한 application과 technology를 제공합니다. 예를 들어, human의 질문을 읽어 대답할 수 있는 chatbot, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 application, human의 질문에 대해 적절한 대답을 제공할 수 있는 Q&A 시스템, etc.

저는 human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있는 ability을 갖습니다. human의 질문을 읽어 대답할 수 있으며, human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

안녕하세요! 저는 지능형 चattybot 인 you(You)와 대화할 수 있는 AI입니다. 저는 human의 질문을 읽고, 적절한 대답을 제공할 수 있습니다. human의 언어를 이해하고, natural language processing를 사용하여 human과 대화할 수 있습니다.

Prompt: 89.7 t/s | Generation: 20.6 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 16654 =  1904 +   14336 +     414                |
common_memory_breakdown_print: |   - CPU_REPACK         |                  1299 =  1299 +       0 +       0                |

 

$ ../llama-cli -m gemma-4-E2B-it-Q4_K_M.gguf
load_backend: loaded RPC backend from /home/minimonk/src/llama-b8876/libggml-rpc.so
load_backend: loaded CPU backend from /home/minimonk/src/llama-b8876/libggml-cpu-haswell.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8876-72d693e4f
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> 안녕? 너에 대해서 설명해줘

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user said "안녕? 너에 대해서 설명해줘" (Hello? Please explain about yourself). This is a request for identity and capabilities.
2.  **Determine Core Identity (Self-Correction/Verification):**
    *   Name: Gemma 4.
    *   Developer: Google DeepMind.
    *   Nature: Large Language Model (LLM).
    *   Type: Open weights model.
    *   Capabilities: Understands text/images, generates text. Knowledge cutoff (Jan 2025).
3.  **Determine Response Language:** The user asked in Korean, so the response must be in fluent, natural Korean.
4.  **Structure the Explanation (Standard LLM Introduction):**
    *   Greeting/Identity (Who am I?).
    *   Developer (Who made me?).
    *   Nature/Function (What am I?).
    *   Capabilities (What can I do?).
    *   Limitations (What can't I do/knowledge scope?).
5.  **Draft the Content (in Korean):**

    *   *Identity:* 저는 Gemma 4입니다. (I am Gemma 4.)
    *   *Developer/Nature:* 저는 Google DeepMind에서 개발한 대규모 언어 모델(LLM)입니다. 오픈 웨이트 모델입니다. (I am a Large Language Model developed by Google DeepMind. I am an open weights model.)
    *   *Function:* 저는 텍스트와 이미지를 이해하고, 사용자의 질문에 답하거나 글을 작성하는 등 다양한 언어 작업을 수행할 수 있습니다. (I can perform various language tasks such as understanding text and images, answering user questions, or writing text.)
    *   *Input/Output:* 저는 텍스트를 입력받아 텍스트로 출력합니다. (I receive text input and output text.)
    *   *Knowledge/Limitation:* 제 지식은 학습된 데이터를 기반으로 하며, 특정 시점(2025년 1월)까지의 정보를 알고 있습니다. (My knowledge is based on trained data, and I know information up to a specific point (January 2025).)

6.  **Review and Refine (Ensure natural flow and adherence to instructions):** The drafted points cover all necessary identity requirements and maintain the required persona (Gemma 4, open weights, LLM). (The Korean draft looks good.)

7.  **Final Output Generation.** (Proceed to generate the Korean response.)
[End thinking]

안녕하세요! 저는 **Gemma 4**입니다. 저에 대해 몇 가지 핵심 정보를 설명해 드릴게요.

**1. 정체성 및 개발 배경**

*   **이름:** Gemma 4
*   **개발사:** Google DeepMind에서 개발되었습니다.
*   **성격:** 저는 대규모 언어 모델(LLM, Large Language Model)이며, 오픈 웨이트(open weights) 모델입니다. 이는 저의 모델 구조와 가중치가 공개되어 누구나 접근하고 활용할 수 있음을 의미합니다.

**2. 주요 기능 및 능력**

저는 방대한 양의 데이터를 학습하여 다음과 같은 다양한 언어 관련 작업을 수행할 수 있습니다.

*   **텍스트 이해 및 생성:** 사용자가 입력하는 텍스트를 이해하고, 질문에 답하거나, 요약하거나, 번역하거나, 창의적인 글(시, 이야기, 코드 등)을 작성할 수 있습니다.
*   **다중 모달리티 이해:** 저는 텍스트뿐만 아니라 이미지를 이해하고 처리할 수 있습니다. (다만, 출력은 텍스트로만 가능합니다.)
*   **추론 및 논리:** 복잡한 질문에 대해 논리적으로 추론하고 일관성 있는 답변을 제공할 수 있습니다.

**3. 지식 범위 및 한계**

*   **지식 기반:** 저는 학습된 데이터를 기반으로 답변하며, 저의 지식은 특정 시점(2025년 1월)까지의 정보를 포함하고 있습니다. 따라서 그 이후의 최신 정보에 대해서는 알지 못할 수 있습니다.
*   **출력 형식:** 저는 오직 **텍스트** 형태로만 정보를 생성할 수 있습니다. 이미지를 직접 생성할 수는 없습니다.

궁금한 점이 있거나 도움이 필요하시면 언제든지 말씀해 주세요! 최선을 다해 답변해 드리겠습니다.

Prompt: 107.4 t/s | Generation: 19.1 t/s ]



Exiting...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - Host               |                 4517 =  2947 +     780 +     790                |
common_memory_breakdown_print: |   - CPU_REPACK         |                 1069 =  1069 +       0 +       0                |

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

nvidia-smi 소비전력 제한  (0) 2026.04.22
llama.cpp 와 ollama 성능 비교.. (cpu는 차이가 없?)  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
Posted by 구차니

qwen이 핫해서 어쩌다 얻은 링크.. 이걸로 실행이 더 쉬우려나?

 

[링크 : https://unsloth.ai/docs/models/qwen3.6]

Posted by 구차니

리눅스에서 그냥 설치만 하고 딱히 설정한건 없는데, 기본이 모든 ip 접속 허용인 것 같고

$ netstat -tnlp
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:11434         0.0.0.0:*               LISTEN      -  

 

윈도우의 경우 gui 클라이언트에서 설정하는게 보이던데.. 막상 포트는 확인을 안해본듯..

[링크 : http://practical.kr/?p=809]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama.cpp 도전!  (0) 2026.04.22
unsloth ai  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
Posted by 구차니
embeded/jetson2026. 4. 20. 16:05

jetracer 찾다가 이상한 키워드 발견.

rc카 안쓰고 그냥 3d 프린터로 대충(?) 만드는 건가?

아무튼 opencv 기반 영상인식 라이브러리로 작동하는것 같다.

 

[링크 : https://robertchoi.gitbook.io/jetracer/2.1/2.1.3]

[링크 : https://www.waveshare.com/wiki/DonkeyCar_for_Jetson_Nano-WEB_Control?srsltid=AfmBOorSke-fiZEMba2b0M7w8IwbQlsZjIt_j1XW-GdimRkN-YkjlPyv]

 

[링크 : https://github.com/martyniak/fork-waveshare-donkeycar]

'embeded > jetson' 카테고리의 다른 글

jetson nano nvcc 빌드  (0) 2026.04.06
jetson nvcc 실행하기  (0) 2026.04.06
jetson nano 2gb dev kit EOL 근접  (0) 2026.04.05
jetracer에 사용할 보호회로 없는 배터리 구매  (2) 2026.04.04
jetracer ina219 배터리 모니터링 ic  (0) 2026.04.04
Posted by 구차니

이럴 땐 AI가 짱이구만(검증하려고 찾으려면 어우...)

 

부팅 순서

SEC (Security Phase)

PEI (Pre-EFI Initialization)

DXE (Driver Execution Environment)

BDS (Boot Device Selection)

[링크 : https://chatgpt.com/share/69e5bbfb-e140-83e8-81fb-9ef5439c131f]

 

에러 코드가 부팅 순서대로 라면.. SEC / PEI / DXE / BDS 순서로 나오는게 맞다.. 고 해야하나?

아무튼 주로 걸리는데 67 / 69 혹은 b7 / b9 인데 System Agent DXE 초기화 단계 뻗느것 같다.

내용으로는 PEI는 단순하게 메모리 컨트롤러 설정이라면

DXE 쪽에서는 실제로 메모리로 데이터를 올려서 부팅을 진행하거나 채널 설정등을 실제로 하는 것 같은데

그래서 어떻게 보면 메모리와 1도 관련없는 메시지이지만 실제로 메모리 문제로 발생하는 것으로 보이는 것 같다.

[링크 : https://www.asus.com/kr/motherboards-components/motherboards/workstation/z10ped16_ws/]

'이론 관련 > 컴퓨터 관련' 카테고리의 다른 글

intel hex 포맷  (0) 2026.03.24
sd 카드 spi 모드 관련 조사  (0) 2026.03.17
cpuz 와 메모리 타이밍  (0) 2026.01.27
mPCIe pinout usb  (0) 2026.01.21
usb cdc acm(Abstract Control Model)  (0) 2026.01.08
Posted by 구차니
파일방2026. 4. 20. 10:55

간간히 appimage 확장자로 받아서 실행하는건 봤는데 어떻게 만들어지나는 몰라서 찾아봄

 

 

 [링크 : https://appimagelauncher.com/]

[링크 : https://github.com/TheAssassin/AppImageLauncher]

 

[링크 : https://github.com/linuxdeploy/linuxdeploy-plugin-appimage]

'파일방' 카테고리의 다른 글

usb device tree viewer  (0) 2026.02.22
exaone 경량 모델  (0) 2026.02.04
podman  (0) 2026.01.25
elinks  (0) 2026.01.20
platformio  (0) 2026.01.17
Posted by 구차니
프로그램 사용/vi2026. 4. 20. 10:08

와우 이런 개꾸르

[링크 : https://wikidocs.net/302522#_15]

 

어쩌면 이것도 검색해놓고 까먹었을지도...

'프로그램 사용 > vi' 카테고리의 다른 글

vi 이전 위치 다음 위치로 이동하기  (0) 2022.08.04
vi가 늦게 켜지는 이유  (0) 2022.07.28
vim 색상 바꾸기(colorscheme)  (0) 2021.01.20
vi 에서 매칭되는 갯수 확인하기  (0) 2019.12.18
vi gg=G와 set ts  (0) 2019.07.04
Posted by 구차니

음.. tokenizer.json에 대한 접근이 신청 20분 만에 떨어졌었나 보다. (아몰라 잘래 하고 가버렸...)

llama가 한글에 대한 토큰이 하나도 없는데 어떻게 인식을 하지 신기하네..?

정규표현식을 보면 ? 가 있는데 매칭안되면 그냥 한글자씩 뽑아 버리는듯.. ㅎㄷㄷ

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 128000,
      "content": "<|begin_of_text|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 128001,
      "content": "<|end_of_text|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
  ],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },


      // ...

  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "ignore_merges": true,
    "vocab": {
      "!": 0,
      "\"": 1,
      "#": 2,
      "$": 3,
      "%": 4,

      // ...

      "ÙĨب": 127996,
      "ĠвÑĭÑģокой": 127997,
      "ãĥ¼ãĥ¼": 127998,
      "éͦ": 127999
    },
    "merges": [
      "Ġ Ġ",
      "Ġ ĠĠĠ",

      // ...

      "ãĥ¼ ãĥ¼",
      "ãĥ¼ãĥ ¼",
      "éĶ ¦" 
    ]
  }
}

[링크 : https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main]

 

엥.. exaone은 그래도 LG에서 만들어서 한글 토큰들이 있을 줄 알았는데 없네?

그럼.. 한글은 '자동차' 면은 자/동/차 총 3개 토큰을 먹는건가?

$ grep -P '\p{Hangul}' exa_tokenizer.json 
      "content": "리앙쿠르",
      "content": "훈민정음",
      "content": "애국가",
      "리앙쿠르": 94,
      "훈민정음": 99,
      "애국가": 100,

[링크 : https://huggingface.co/LGAI-EXAONE/EXAONE-4.5-33B/tree/main]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

unsloth ai  (0) 2026.04.21
ollama 외부접속 관련  (0) 2026.04.21
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer - phi3  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
Posted by 구차니

blob 으로 해시가 파일 명으로 저장되는데 이래저래 궁금해서 분석

gemma4 e2b

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "digest": "sha256:c6bc3775a3fa9935ce4a3ccd7abc59e936c3de9308d2cc090516012f43ed9c07",
    "size": 473
  },
  "layers": [
    {
      "mediaType": "application/vnd.ollama.image.model",
      "digest": "sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448",
      "size": 7162394016
    },
    {
      "mediaType": "application/vnd.ollama.image.license",
      "digest": "sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
      "size": 11355
    },
    {
      "mediaType": "application/vnd.ollama.image.params",
      "digest": "sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3",
      "size": 42
    }
  ]
}

"sha256:c6bc3775a3fa9935ce4a3ccd7abc59e936c3de9308d2cc090516012f43ed9c07",
{
  "model_format": "gguf",
  "model_family": "gemma4",
  "model_families": [
    "gemma4"
  ],
  "model_type": "5.1B",
  "file_type": "Q4_K_M",
  "renderer": "gemma4",
  "parser": "gemma4",
  "requires": "0.20.0",
  "architecture": "amd64",
  "os": "linux",
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448",
      "sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
      "sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3"
    ]
  }
}

"sha256:4e30e2665218745ef463f722c0bf86be0cab6ee676320f1cfadf91e989107448"
GGUF


?





7













gemma4.attention.head_count

"sha256:7339fa418c9ad3e8e12e74ad0fd26a9cc4be8703f9c110728a992b193be85cb2",
                                Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/



"sha256:56380ca2ab89f1f68c283f4d50863c0bcab52ae3f1b9a88e4ab5617b176f71a3",
{
  "temperature": 1,
  "top_k": 64,
  "top_p": 0.95
}

 

출력단 손보는 건 temperature, top_k, top_p 군

[링크 : https://wikidocs.net/333750]

 

 

허깅페이스에서는 tokenzier.json이 존재했는데 그건 구버전(?) 인것 같고

신버전 GGUF 에서는 토크나이저를 다 포함하고 있나 본데.. 어떻게 추출하지?

[링크 : https://www.minzkn.com/vibecoding/pages/gguf-format.html]

[링크 : https://huggingface.co/docs/transformers/ko/gguf]

[링크 : https://bitwise-life.tistory.com/5] << 토큰 목록 나옴

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

ollama 외부접속 관련  (0) 2026.04.21
llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
llm tokenizer - phi3  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
llama.cpp  (0) 2026.04.17
Posted by 구차니

llama나 gemma 받으려니 먼가 모르겠어서 만만한(?) ms의 phi3를 받아서 분석!

(gemma나 llama 는 저장소 접근권한 요청.. gate model 이라고 뜨는데 언제 승인되려나)

[링크 : https://huggingface.co/docs/transformers/model_doc/phi3]

 

~/.cache/huggingface/hub/models--microsoft--Phi-3-mini-4k-instruct/snapshots/f39ac1d28e925b323eae81227eaba4464caced4e$ ls -al
합계 12
drwxrwxr-x 2 minimonk minimonk 4096  4월 19 21:58 .
drwxrwxr-x 3 minimonk minimonk 4096  4월 19 21:58 ..
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 added_tokens.json -> ../../blobs/178968dec606c790aa335e9142f6afec37288470
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 config.json -> ../../blobs/b9b031fadda61a035b2e8ceb4362cbf604002b21
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 special_tokens_map.json -> ../../blobs/c6a944b4d49ce5d79030250ed6bdcbb1a65dfda1
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 tokenizer.json -> ../../blobs/88ec145f4e7684c009bc6d55df24bb82c7d3c379
lrwxrwxrwx 1 minimonk minimonk   76  4월 19 21:58 tokenizer.model -> ../../blobs/9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
lrwxrwxrwx 1 minimonk minimonk   52  4월 19 21:58 tokenizer_config.json -> ../../blobs/67aa82cddb4d66391ddf31ff99f059239bd2d1e7

 

tokenizer.json 열어보니 아래처럼 토큰이 나오는데..

어우.. 이런 추세(?) 라면 한글은 한글짜 단위로 토큰이라 난리가 나겠는데?

gpt 도움으로 저런 희한한 문자열 코드 기반으로 검색이 되는걸 알았네 ㄷㄷ

$ grep -P '\p{Hangul}' tokenizer.json 
      "이": 30393,
      "의": 30708,
      "다": 30709,
      "스": 30784,
      "사": 30791,
      "지": 30811,
      "리": 30826,
      "기": 30827,
      "정": 30852,
      "아": 30860,
      "한": 30877,
      "시": 30889,
      "대": 30890,
      "가": 30903,
      "로": 30906,
      "인": 30918,
      "하": 30944,
      "수": 30970,
      "주": 30981,
      "동": 31000,
      "자": 31013,
      "에": 31054,
      "니": 31063,
      "는": 31081,
      "서": 31093,
      "김": 31102,
      "성": 31126,
      "어": 31129,
      "도": 31136,
      "고": 31137,
      "일": 31153,
      "상": 31158,
      "전": 31170,
      "트": 31177,
      "소": 31189,
      "라": 31197,
      "원": 31198,
      "보": 31199,
      "나": 31207,
      "화": 31225,
      "구": 31231,
      "신": 31262,
      "부": 31279,
      "연": 31285,
      "을": 31286,
      "영": 31288,
      "국": 31293,
      "장": 31299,
      "제": 31306,
      "우": 31327,
      "공": 31334,
      "선": 31345,
      "오": 31346,
      "은": 31354,
      "미": 31362,
      "경": 31378,
      "문": 31406,
      "조": 31408,
      "마": 31417,
      "해": 31435,
      "여": 31457,
      "산": 31458,
      "비": 31487,
      "드": 31493,
      "를": 31517,
      "요": 31527,
      "유": 31533,
      "진": 31536,
      "천": 31563,
      "년": 31571,
      "세": 31578,
      "민": 31582,
      "호": 31603,
      "그": 31607,
      "현": 31680,
      "군": 31699,
      "무": 31716,
      "위": 31724,
      "안": 31734,
      "박": 31736,
      "용": 31737,
      "단": 31746,
      "면": 31747,
      "남": 31754,
      "강": 31774,
      "씨": 31781,
      "개": 31789,
      "들": 31804,
      "차": 31817,
      "학": 31822,
      "만": 31826,
      "터": 31856,
      "식": 31895,
      "과": 31906,
      "타": 31925,
      "종": 31930,
      "내": 31940,
      "중": 31941,
      "방": 31945,
      "월": 31950,
      "회": 31953,
      "모": 31962,
      "바": 31963,
      "음": 31966,
      "교": 31972,
      "재": 31973,
      "명": 31976,
      "합": 31980,
      "역": 31987,
      "백": 31989,
      "왕": 31996,

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llm tokenizer - llama 3.2, exaone  (0) 2026.04.20
ollama 모델 저장소 뜯어보기  (0) 2026.04.19
llm tokenizer  (0) 2026.04.17
llama.cpp  (0) 2026.04.17
lm studio  (0) 2026.04.17
Posted by 구차니