플랫폼 llama.cpp B9500 vulkan / ubuntu 22.04 / 32GB
명령줄
| $ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf -mm ./model/gemma4-12b/mmproj-F16.gguf -sm layer |
결론 : t4가 이상하게 12B 모델은 힘을 못쓴다. e4b에 비하면 1080 ti 도 절반정도 성능.
하드웨어 nvidia tesla t4 16GB
| gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 1.5s 16.46 t/s gemma-4 12B it Q4_0.gguf Reading Generation 260 tokens 17s 15.23 t/s gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 1min 35s 13.23 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 36 tokens 2.1s 17.06 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 360 tokens 22s 16.29 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 1,379 tokens 2min 2s 11.27 t/s |
하드웨어 1080 ti -sm none
| gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.9s 27.94 t/s gemma-4 12B it Q4_0.gguf Reading Generation 255 tokens 8.9s 28.78 t/s gemma-4 12B it Q4_0.gguf Reading Generation 1,404 tokens 55s 25.45 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 29 tokens 1.2s 23.71 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 373 tokens 16s 22.28 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 806 tokens 37s 21.34 t/s (터짐) |
하드웨어 1080 ti -sm layer
| gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.8s 31.04 t/s gemma-4 12B it Q4_0.gguf Reading Generation 265 tokens 9.0s 29.60 t/s gemma-4 12B it Q4_0.gguf Reading Generation 1,340 tokens 54s 24.43 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 31 tokens 1.3s 24.16 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 263 tokens 11s 23.70 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 620 tokens 29s 20.70 t/s (터짐) |
---------
llama.cpp 버전업을 해야 하려나..
| /mnt/Downloads$ llama-b9305/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F32.gguf -sm layer 0.00.253.692 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg) 0.00.253.695 I device_info: 0.00.260.623 I - Vulkan0 : Intel(R) UHD Graphics 630 (CFL GT2) (23816 MiB, 23816 MiB free) 0.00.266.859 I - Vulkan1 : NVIDIA GeForce GTX 1080 Ti (11510 MiB, 11247 MiB free) 0.00.274.269 I - Vulkan2 : NVIDIA GeForce GTX 1080 Ti (11510 MiB, 11389 MiB free) 0.00.274.274 I - CPU : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free) 0.00.274.306 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.00.274.309 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true 0.00.274.334 I srv init: running without SSL 0.00.274.366 I srv init: using 8 threads for HTTP server 0.00.274.493 I srv start: binding port with default address family 0.00.275.753 I srv llama_server: loading model 0.00.275.766 I srv load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf' 0.00.319.114 E mtmd_get_memory_usage: error: Failed to load CLIP model from ./model/gemma4-12b/mmproj-F32.gguf 0.00.319.119 E srv load_model: [mtmd] failed to get memory usage of mmproj 0.00.319.134 I common_init_result: fitting params to device memory ... 0.00.319.134 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) 0.01.922.555 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.923.284 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.933.392 W load: control-looking token: 1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.966.022 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list 0.04.567.079 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) 0.04.903.218 E clip_init: failed to load model './model/gemma4-12b/mmproj-F32.gguf': load_hparams: unknown projector type: gemma4uv 0.04.903.588 E mtmd_init_from_file: error: Failed to load CLIP model from ./model/gemma4-12b/mmproj-F32.gguf 0.04.903.600 E srv load_model: failed to load multimodal model, './model/gemma4-12b/mmproj-F32.gguf' 0.04.903.603 I srv operator(): operator(): cleaning up before exit... 0.04.904.452 E srv llama_server: exiting due to model loading error |
b9500 까지 나왔으니 언넝 최신으로 ㄱㄱ
| mtmd: enable non-causal vision for gemma 4 unified (#24082) |
[링크 : https://github.com/ggml-org/llama.cpp/releases/tag/b9494]
1080 ti 에서 멀티모달은 일단 포기하고 -sm none 으로 테스트 한 결과는 아래와 같다.
| gemma-4 12B it Q4_0.gguf Reading Generation 302 tokens 10s 29.78 t/s gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 42s 29.96 t/s gemma-4 12B it Q4_0.gguf Reading Generation 2,390 tokens 1min 21s 29.47 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 327 tokens 13s 24.12 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 943 tokens 38s 24.48 t/s gemma-4 12B it UDQ2_K_XL.gguf Reading Generation 2,135 tokens 1min 38s 21.78 t/s |
파이썬 프로그램은 좀 생성한다 싶으면 터져서 무한반복해서 쓸 수 있나 모르겠다.
+
b9500 으로 하니 문제없이 실행된다.
| /mnt/Downloads$ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F32.gguf -sm layer 0.00.007.970 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg) 0.00.007.972 I device_info: 0.00.007.994 I - CPU : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free) 0.00.008.016 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.00.008.019 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true 0.00.008.053 I srv init: running without SSL 0.00.008.087 I srv init: using 8 threads for HTTP server 0.00.008.191 I srv start: binding port with default address family 0.00.009.347 I srv llama_server: loading model 0.00.009.369 I srv load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf' 0.00.143.685 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 373.20 MiB 0.00.143.699 I common_init_result: fitting params to device memory ... 0.00.143.699 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) 0.00.982.328 I common_params_fit_impl: projected to use 8171 MiB of host memory vs. 31754 MiB of total host memory 0.01.591.123 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.591.878 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.602.087 W load: control-looking token: 1 '<eos>' was not control-type; this is probably a bug in the model. its type will be overridden 0.01.635.174 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list 0.04.047.792 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) 0.04.547.906 W init_audio: audio input is in experimental stage and may have reduced quality: https://github.com/ggml-org/llama.cpp/discussions/13759 0.04.547.912 I srv load_model: loaded multimodal model, './model/gemma4-12b/mmproj-F32.gguf' 0.04.547.935 I srv load_model: initializing slots, n_slots = 4 0.05.193.680 W common_speculative_init: no implementations specified for speculative decoding 0.05.193.688 I slot load_model: id 0 | task -1 | new slot, n_ctx = 131072 0.05.193.694 I slot load_model: id 1 | task -1 | new slot, n_ctx = 131072 0.05.193.694 I slot load_model: id 2 | task -1 | new slot, n_ctx = 131072 0.05.193.694 I slot load_model: id 3 | task -1 | new slot, n_ctx = 131072 0.05.193.753 I srv load_model: prompt cache is enabled, size limit: 8192 MiB 0.05.193.753 I srv load_model: use `--cache-ram 0` to disable the prompt cache 0.05.193.754 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391 0.05.193.754 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256 0.05.193.776 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task 0.05.202.571 I init: chat template, example_format: '<|turn>system <|think|> You are a helpful assistant<turn|> <|turn>user Hello<turn|> <|turn>model Hi there<turn|> <|turn>user How are you?<turn|> <|turn>model ' 0.05.203.449 I srv init: init: chat template, thinking = 1 0.05.203.479 I srv llama_server: model loaded 0.05.203.483 I srv llama_server: server is listening on http://0.0.0.0:8080 0.05.203.488 I srv update_slots: all slots are idle |
느려서 sm none 하니까 터진다. 머냐? llama.cpp 버전 올라가면서 문제인가.. 아니면 메모리 소모량이 늘은거냐..
| $ llama-b9500/llama-server --host 0.0.0.0 --model ./model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -mm ./model/gemma4-12b/mmproj-F16.gguf -sm none 0.00.007.825 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg) 0.00.007.827 I device_info: 0.00.007.849 I - CPU : Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz (31754 MiB, 31754 MiB free) 0.00.007.871 I system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 0.00.007.873 I srv llama_server: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true 0.00.007.906 I srv init: running without SSL 0.00.007.941 I srv init: using 8 threads for HTTP server 0.00.008.026 I srv start: binding port with default address family 0.00.009.238 I srv llama_server: loading model 0.00.009.273 I srv load_model: loading model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf' 0.00.083.085 I srv load_model: [mtmd] estimated worst-case memory usage of mmproj is 239.14 MiB 0.00.083.101 I common_init_result: fitting params to device memory ... 0.00.083.101 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on) 0.00.352.254 E llama_prepare_model_devices: invalid value for main_gpu: 0 (available devices: 0) 0.00.355.698 E llama_model_load_from_file_impl: failed to load model 0.00.355.750 E common_fit_params: encountered an error while trying to fit params to free device memory: failed to load model 0.00.604.770 E llama_prepare_model_devices: invalid value for main_gpu: 0 (available devices: 0) 0.00.609.674 E llama_model_load_from_file_impl: failed to load model 0.00.609.680 E common_init_from_params: failed to load model './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf' 0.00.609.684 E srv load_model: failed to load model, './model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf' 0.00.609.685 I srv operator(): operator(): cleaning up before exit... 0.00.610.314 E srv llama_server: exiting due to model loading error |
+
TESLA T4 / llama.cpp B9500 vulkan 사용
이상하게 낮게 나오네.
| gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 36 tokens 2.1s 17.06 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 360 tokens 22s 16.29 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 1,379 tokens 2min 2s 11.27 t/s gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 1.5s 16.46 t/s gemma-4 12B it Q4_0.gguf Reading Generation 260 tokens 17s 15.23 t/s gemma-4 12B it Q4_0.gguf Reading Generation 1,262 tokens 1min 35s 13.23 t/s |
'프로그램 사용 > ai 프로그램' 카테고리의 다른 글
| chatML (0) | 2026.06.04 |
|---|---|
| nvidia tesla t4 16GB (0) | 2026.06.02 |
| llama.cpp reasoning 옵션 (0) | 2026.06.01 |
| safetensors to gguf 일단 실패 (0) | 2026.06.01 |
| antigravity 2 (0) | 2026.05.28 |
