gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2 / 3070 8GB

프로그램 사용/ai 프로그램2026. 6. 4. 15:28

gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2 / 3070 8GB

플랫폼 llama.cpp B9500 vulkan / ubuntu 22.04 / 32GB

명령줄

$ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf -mm ./model/gemma4-12b/mmproj-F16.gguf -sm layer

결론 : t4가 이상하게 12B 모델은 힘을 못쓴다. e4b에 비하면 1080 ti 도 절반정도 성능.

하드웨어 nvidia tesla t4 16GB

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 1.5s 16.46 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 260 tokens 17s 15.23 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 1min 35s 13.23 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 36 tokens 2.1s 17.06 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 360 tokens 22s 16.29 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 1,379 tokens 2min 2s 11.27 t/s

하드웨어 1080 ti -sm none

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.9s 27.94 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 255 tokens 8.9s 28.78 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,404 tokens 55s 25.45 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 29 tokens 1.2s 23.71 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 373 tokens 16s 22.28 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 806 tokens 37s 21.34 t/s (터짐)

하드웨어 1080 ti -sm layer

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.8s 31.04 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 265 tokens 9.0s 29.60 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,340 tokens 54s 24.43 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 31 tokens 1.3s 24.16 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 263 tokens 11s 23.70 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 620 tokens 29s 20.70 t/s (터짐)

---------

llama.cpp 버전업을 해야 하려나..

/mnt/Downloads$ llama-b9305/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F32.gguf -sm layer
0.00.253.692 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.253.695 I device_info:
0.00.260.623 I   - Vulkan0 : Intel(R) UHD Graphics 630 (CFL GT2) (23816 MiB, 23816 MiB free)
0.00.266.859 I   - Vulkan1 : NVIDIA GeForce GTX 1080 Ti (11510 MiB, 11247 MiB free)
0.00.274.269 I   - Vulkan2 : NVIDIA GeForce GTX 1080 Ti (11510 MiB, 11389 MiB free)
0.00.274.274 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free)
0.00.274.306 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.274.309 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.274.334 I srv          init: running without SSL
0.00.274.366 I srv          init: using 8 threads for HTTP server
0.00.274.493 I srv         start: binding port with default address family
0.00.275.753 I srv  llama_server: loading model
0.00.275.766 I srv    load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.319.114 E mtmd_get_memory_usage: error: Failed to load CLIP model from ./model/gemma4-12b/mmproj-F32.gguf

0.00.319.119 E srv    load_model: [mtmd] failed to get memory usage of mmproj
0.00.319.134 I common_init_result: fitting params to device memory ...
0.00.319.134 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.01.922.555 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.923.284 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.933.392 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.966.022 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.04.567.079 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.04.903.218 E clip_init: failed to load model './model/gemma4-12b/mmproj-F32.gguf': load_hparams: unknown projector type: gemma4uv

0.04.903.588 E mtmd_init_from_file: error: Failed to load CLIP model from ./model/gemma4-12b/mmproj-F32.gguf

0.04.903.600 E srv    load_model: failed to load multimodal model, './model/gemma4-12b/mmproj-F32.gguf'
0.04.903.603 I srv    operator(): operator(): cleaning up before exit...
0.04.904.452 E srv  llama_server: exiting due to model loading error

b9500 까지 나왔으니 언넝 최신으로 ㄱㄱ

mtmd: enable non-causal vision for gemma 4 unified (#24082)

[링크 : https://github.com/ggml-org/llama.cpp/releases/tag/b9494]

1080 ti 에서 멀티모달은 일단 포기하고 -sm none 으로 테스트 한 결과는 아래와 같다.

gemma-4 12B it Q4_0.gguf Reading Generation 302 tokens 10s 29.78 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 42s 29.96 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 2,390 tokens 1min 21s 29.47 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 327 tokens 13s 24.12 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 943 tokens 38s 24.48 t/s
gemma-4 12B it UDQ2_K_XL.gguf Reading Generation 2,135 tokens 1min 38s 21.78 t/s

파이썬 프로그램은 좀 생성한다 싶으면 터져서 무한반복해서 쓸 수 있나 모르겠다.

b9500 으로 하니 문제없이 실행된다.

/mnt/Downloads$ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F32.gguf -sm layer
0.00.007.970 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.007.972 I device_info:
0.00.007.994 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free)
0.00.008.016 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.008.019 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.008.053 I srv          init: running without SSL
0.00.008.087 I srv          init: using 8 threads for HTTP server
0.00.008.191 I srv         start: binding port with default address family
0.00.009.347 I srv  llama_server: loading model
0.00.009.369 I srv    load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.143.685 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 373.20 MiB
0.00.143.699 I common_init_result: fitting params to device memory ...
0.00.143.699 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.982.328 I common_params_fit_impl: projected to use 8171 MiB of host memory vs. 31754 MiB of total host memory
0.01.591.123 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.591.878 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.602.087 W load: control-looking token:      1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden
0.01.635.174 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.04.047.792 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.04.547.906 W init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
0.04.547.912 I srv    load_model: loaded multimodal model, './model/gemma4-12b/mmproj-F32.gguf'
0.04.547.935 I srv    load_model: initializing slots, n_slots = 4
0.05.193.680 W common_speculative_init: no implementations specified for speculative decoding
0.05.193.688 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
0.05.193.694 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 131072
0.05.193.694 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 131072
0.05.193.694 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 131072
0.05.193.753 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.05.193.753 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.05.193.754 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.05.193.754 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.05.193.776 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.05.202.571 I init: chat template, example_format: '<|turn>system
<|think|>
You are a helpful assistant<turn|>
<|turn>user
Hello<turn|>
<|turn>model
Hi there<turn|>
<|turn>user
How are you?<turn|>
<|turn>model
'
0.05.203.449 I srv          init: init: chat template, thinking = 1
0.05.203.479 I srv  llama_server: model loaded
0.05.203.483 I srv  llama_server: server is listening on http://0.0.0.0:8080
0.05.203.488 I srv  update_slots: all slots are idle

느려서 sm none 하니까 터진다. 머냐? llama.cpp 버전 올라가면서 문제인가.. 아니면 메모리 소모량이 늘은거냐..

$ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F16.gguf -sm none
0.00.007.825 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.007.827 I device_info:
0.00.007.849 I   - CPU     : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free)
0.00.007.871 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.007.873 I srv  llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
0.00.007.906 I srv          init: running without SSL
0.00.007.941 I srv          init: using 8 threads for HTTP server
0.00.008.026 I srv         start: binding port with default address family
0.00.009.238 I srv  llama_server: loading model
0.00.009.273 I srv    load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.083.085 I srv    load_model: [mtmd] estimated worst-case memory usage of mmproj is 239.14 MiB
0.00.083.101 I common_init_result: fitting params to device memory ...
0.00.083.101 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.352.254 E llama_prepare_model_devices: invalid value for main_gpu: 0 (available devices: 0)
0.00.355.698 E llama_model_load_from_file_impl: failed to load model
0.00.355.750 E common_fit_params: encountered an error while trying to fit params to free device memory: failed to load model
0.00.604.770 E llama_prepare_model_devices: invalid value for main_gpu: 0 (available devices: 0)
0.00.609.674 E llama_model_load_from_file_impl: failed to load model
0.00.609.680 E common_init_from_params: failed to load model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.609.684 E srv    load_model: failed to load model, './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf'
0.00.609.685 I srv    operator(): operator(): cleaning up before exit...
0.00.610.314 E srv  llama_server: exiting due to model loading error

TESLA T4 / llama.cpp B9500 vulkan 사용

이상하게 낮게 나오네.

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 36 tokens 2.1s 17.06 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 360 tokens 22s 16.29 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 1,379 tokens 2min 2s 11.27 t/s

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 1.5s 16.46 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 260 tokens 17s 15.23 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 1min 35s 13.23 t/s

2026.06.17

ubunt 26.04 + driver 595.71.05 + CUDA 13.2

오프로딩 된 건진 모르겠음

$ ./llama-b9553/llama-cli -m ./model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf

[ Prompt: 5.6 t/s | Generation: 36.5 t/s ] [ Prompt: 5.6 t/s | Generation: 36.5 t/s ]
[ Prompt: 320.6 t/s | Generation: 37.7 t/s ]
[ Prompt: 66.3 t/s | Generation: 36.3 t/s ]

$ ./llama-b9553/llama-cli -m ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf

[ Prompt: 2.1 t/s | Generation: 46.0 t/s ] / [ Prompt: 138.3 t/s | Generation: 55.0 t/s ]
[ Prompt: 3.8 t/s | Generation: 53.9 t/s ]
[ Prompt: 12.9 t/s | Generation: 51.4 t/s ]

저작자표시 (새창열림)

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

sigLIP, CLIP (0)	2026.06.05
chatML (0)	2026.06.04
nvidia tesla t4 16GB (0)	2026.06.02
llama.cpp reasoning 옵션 (0)	2026.06.01
torchvision model (0)	2026.06.01

Posted by 구차니

구차니의 잡동사니 모음

gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2 / 3070 8GB

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바