gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2 / 3070 8GB

구차니 2026. 6. 4. 15:28

플랫폼 llama.cpp B9500 vulkan / ubuntu 22.04 / 32GB

명령줄

$ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf -mm ./model/gemma4-12b/mmproj-F16.gguf -sm layer

결론 : t4가 이상하게 12B 모델은 힘을 못쓴다. e4b에 비하면 1080 ti 도 절반정도 성능.

하드웨어 nvidia tesla t4 16GB

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 1.5s 16.46 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 260 tokens 17s 15.23 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 1min 35s 13.23 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 36 tokens 2.1s 17.06 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 360 tokens 22s 16.29 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 1,379 tokens 2min 2s 11.27 t/s

하드웨어 1080 ti -sm none

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.9s 27.94 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 255 tokens 8.9s 28.78 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,404 tokens 55s 25.45 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 29 tokens 1.2s 23.71 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 373 tokens 16s 22.28 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 806 tokens 37s 21.34 t/s (터짐)

하드웨어 1080 ti -sm layer

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.8s 31.04 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 265 tokens 9.0s 29.60 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,340 tokens 54s 24.43 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 31 tokens 1.3s 24.16 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 263 tokens 11s 23.70 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 620 tokens 29s 20.70 t/s (터짐)

---------

llama.cpp 버전업을 해야 하려나..

/mnt/Downloads$ llama-b9305/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F32.gguf -sm layer
0.00.253.692 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.253.695 I device_info:
0.00.260.623 I   - Vulkan0 : Intel(R) UHD Graphics 630 (CFL GT2) (23816 MiB, 23816 MiB free)
0.00.266.859 I   - Vulkan1 : NVIDIA GeForce GTX 1080 Ti (11510 MiB, 11247 MiB free)
0.00.274.269 I   - Vulkan2 : NVIDIA GeForce GTX 1080 Ti (11510 MiB, 11389 MiB free)
0.00.274.274 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free)
0.00.274.306 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.274.309 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.274.334 I srv          init: running without SSL
0.00.274.366 I srv          init: using 8 threads for HTTP server
0.00.274.493 I srv         start: binding port with default address family
0.00.275.753 I srv  llama_server: loading model
0.00.275.766 I srv    load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.319.114 E mtmd_get_memory_usage: error: Failed to load CLIP model from ./model/gemma4-12b/mmproj-F32.gguf

0.00.319.119 E srv    load_model: [mtmd] failed to get memory usage of mmproj
0.00.319.134 I common_init_result: fitting params to device memory ...
0.00.319.134 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.01.922.555 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.923.284 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.933.392 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.966.022 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.04.567.079 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.04.903.218 E clip_init: failed to load model './model/gemma4-12b/mmproj-F32.gguf': load_hparams: unknown projector type: gemma4uv

0.04.903.588 E mtmd_init_from_file: error: Failed to load CLIP model from ./model/gemma4-12b/mmproj-F32.gguf

0.04.903.600 E srv    load_model: failed to load multimodal model, './model/gemma4-12b/mmproj-F32.gguf'
0.04.903.603 I srv    operator(): operator(): cleaning up before exit...
0.04.904.452 E srv  llama_server: exiting due to model loading error

b9500 까지 나왔으니 언넝 최신으로 ㄱㄱ

mtmd: enable non-causal vision for gemma 4 unified (#24082)

[링크 : https://github.com/ggml-org/llama.cpp/releases/tag/b9494]

1080 ti 에서 멀티모달은 일단 포기하고 -sm none 으로 테스트 한 결과는 아래와 같다.

gemma-4 12B it Q4_0.gguf Reading Generation 302 tokens 10s 29.78 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 42s 29.96 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 2,390 tokens 1min 21s 29.47 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 327 tokens 13s 24.12 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 943 tokens 38s 24.48 t/s
gemma-4 12B it UDQ2_K_XL.gguf Reading Generation 2,135 tokens 1min 38s 21.78 t/s

파이썬 프로그램은 좀 생성한다 싶으면 터져서 무한반복해서 쓸 수 있나 모르겠다.

b9500 으로 하니 문제없이 실행된다.

/mnt/Downloads$ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F32.gguf -sm layer
0.00.007.970 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.007.972 I device_info:
0.00.007.994 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free)
0.00.008.016 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.008.019 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.008.053 I srv          init: running without SSL
0.00.008.087 I srv          init: using 8 threads for HTTP server
0.00.008.191 I srv         start: binding port with default address family
0.00.009.347 I srv  llama_server: loading model
0.00.009.369 I srv    load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.143.685 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 373.20 MiB
0.00.143.699 I common_init_result: fitting params to device memory ...
0.00.143.699 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.982.328 I common_params_fit_impl: projected to use 8171 MiB of host memory vs. 31754 MiB of total host memory
0.01.591.123 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.591.878 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.602.087 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.635.174 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.04.047.792 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.04.547.906 W init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
0.04.547.912 I srv    load_model: loaded multimodal model, './model/gemma4-12b/mmproj-F32.gguf'
0.04.547.935 I srv    load_model: initializing slots, n_slots = 4
0.05.193.680 W common_speculative_init: no implementations specified for speculative decoding
0.05.193.688 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
0.05.193.694 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 131072
0.05.193.694 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 131072
0.05.193.694 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 131072
0.05.193.753 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.05.193.753 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.05.193.754 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.05.193.754 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.05.193.776 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.05.202.571 I init: chat template, example_format: '<|turn>system
<|think|>
You are a helpful assistant<turn|>
<|turn>user
Hello<turn|>
<|turn>model
Hi there<turn|>
<|turn>user
How are you?<turn|>
<|turn>model
'
0.05.203.449 I srv          init: init: chat template, thinking = 1
0.05.203.479 I srv  llama_server: model loaded
0.05.203.483 I srv  llama_server: server is listening on http://0.0.0.0:8080
0.05.203.488 I srv  update_slots: all slots are idle

느려서 sm none 하니까 터진다. 머냐? llama.cpp 버전 올라가면서 문제인가.. 아니면 메모리 소모량이 늘은거냐..

$ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F16.gguf -sm none
0.00.007.825 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.007.827 I device_info:
0.00.007.849 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free)
0.00.007.871 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.007.873 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.007.906 I srv          init: running without SSL
0.00.007.941 I srv          init: using 8 threads for HTTP server
0.00.008.026 I srv         start: binding port with default address family
0.00.009.238 I srv  llama_server: loading model
0.00.009.273 I srv    load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.083.085 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 239.14 MiB
0.00.083.101 I common_init_result: fitting params to device memory ...
0.00.083.101 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.352.254 E llama_prepare_model_devices: invalid value for main_gpu: 0 (available devices: 0)
0.00.355.698 E llama_model_load_from_file_impl: failed to load model
0.00.355.750 E common_fit_params: encountered an error while trying to fit params to free device memory: failed to load model
0.00.604.770 E llama_prepare_model_devices: invalid value for main_gpu: 0 (available devices: 0)
0.00.609.674 E llama_model_load_from_file_impl: failed to load model
0.00.609.680 E common_init_from_params: failed to load model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.609.684 E srv    load_model: failed to load model, './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.609.685 I srv    operator(): operator(): cleaning up before exit...
0.00.610.314 E srv  llama_server: exiting due to model loading error

TESLA T4 / llama.cpp B9500 vulkan 사용

이상하게 낮게 나오네.

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 36 tokens 2.1s 17.06 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 360 tokens 22s 16.29 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 1,379 tokens 2min 2s 11.27 t/s

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 1.5s 16.46 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 260 tokens 17s 15.23 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 1min 35s 13.23 t/s

2026.06.17

ubunt 26.04 + driver 595.71.05 + CUDA 13.2

오프로딩 된 건진 모르겠음

$ ./llama-b9553/llama-cli -m ./model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf

[ Prompt: 5.6 t/s | Generation: 36.5 t/s ] [ Prompt: 5.6 t/s | Generation: 36.5 t/s ]
[ Prompt: 320.6 t/s | Generation: 37.7 t/s ]
[ Prompt: 66.3 t/s | Generation: 36.3 t/s ]

$ ./llama-b9553/llama-cli -m ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf

[ Prompt: 2.1 t/s | Generation: 46.0 t/s ] / [ Prompt: 138.3 t/s | Generation: 55.0 t/s ]
[ Prompt: 3.8 t/s | Generation: 53.9 t/s ]
[ Prompt: 12.9 t/s | Generation: 51.4 t/s ]

저작자표시 (새창열림)