요약
QAT는 생성속도 차이는 크게 없어 보임. 사용해봐야 결과 품질을 알 수 있을 듯 함.
MTP는 50% 정도 성능 향상이 되는 듯?
---
QAT
오오 3~4일 전 따끈한 모델!
용량이 3~4GB 정도라 정말 어떨지 궁금하다.
[링크 : https://huggingface.co/unsloth/gemma-4-E4B-it-qat-GGUF]
기존에 테스트 하던건 Q4_K_M 이라 비슷할진 모르겠다.
$ ../../llama-b9553/llama-cli -m gemma-4-E4B-it-qat-UD-Q2_K_XL.gguf -sm none [ Prompt: 16.8 t/s | Generation: 38.6 t/s ] [ Prompt: 97.9 t/s | Generation: 41.1 t/s ] [ Prompt: 196.1 t/s | Generation: 39.9 t/s ] |
$ ../../llama-b9553/llama-cli -m gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf -sm none [ Prompt: 737.0 t/s | Generation: 62.5 t/s ] [ Prompt: 238.5 t/s | Generation: 61.4 t/s ] [ Prompt: 292.3 t/s | Generation: 58.0 t/s ] |
MTP
MTP는 multimodal 처럼 2개의 모델 파일이 필요하구나..
일단은 cuda enable 하고 빌드하려면.. sdk가 문제 없으려나.. 쩝
./build/bin/llama-server \ -m gemma-4-12b-it-Q4_K_M.gguf \ --model-draft MTP/gemma-4-12B-it-MTP-Q8_0.gguf \ --spec-type draft-mtp --spec-draft-n-max 4 \ -ngl 999 -fa on Multi GPU: add --spec-draft-device CUDA0 -sm layer. |
[링크 : https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/blob/main/MTP/README.md]
+
음.. 장렬히 빌드 시도 폭★파 ㅋㅋㅋ
$ cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61 CMAKE_BUILD_TYPE=Release -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- GGML_SYSTEM_ARCH: x86 -- Including CPU backend -- x86 detected -- Adding CPU backend variant ggml-cpu: -march=native -- Unable to find cublas_v2.h in either "/usr/local/cuda/include" or "/usr/math_libs/include" -- CUDA Toolkit found CMake Error at /usr/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:726 (message): Compiling the CUDA compiler identification source file "CMakeCUDACompilerId.cu" failed.
Compiler: /usr/local/cuda/bin/nvcc
Build flags:
Id flags: --keep;--keep-dir;tmp;-gencode=arch=compute_61,code=sm_61 -v
The output was:
1
nvcc fatal : Unsupported gpu architecture 'compute_61'
Call Stack (most recent call first): /usr/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:6 (CMAKE_DETERMINE_COMPILER_ID_BUILD) /usr/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:48 (__determine_compiler_id_test) /usr/share/cmake-3.22/Modules/CMakeDetermineCUDACompiler.cmake:298 (CMAKE_DETERMINE_COMPILER_ID) ggml/src/ggml-cuda/CMakeLists.txt:59 (enable_language)
-- Configuring incomplete, errors occurred! See also "/home/falinux/src/llama.cpp/build/CMakeFiles/CMakeOutput.log". See also "/home/falinux/src/llama.cpp/build/CMakeFiles/CMakeError.log". |
+
b9500 으로는 무리인가.. 아니면 vulkan 모델이라 안되는걸까?
| $ ../../llama-b9500/llama-cli -m gemma-4-12b-it-Q4_0.gguf --model-draft gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4 -ngl 999 -fa on --verbose |
0.19.888.319 E llama_model_load: error loading model: unknown model architecture: 'gemma4-assistant' 0.19.888.322 E llama_model_load_from_file_impl: failed to load model 0.19.888.324 E srv load_model: failed to load draft model, 'gemma-4-12B-it-MTP-Q8_0.gguf' |
b9953 으로 하니 돌아간다.
1080 ti 11GB / -sm none
| $ ../../llama-b9553/llama-cli -m gemma-4-12b-it-Q4_0.gguf --model-draft gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4 -ngl 999 -fa on -sm none |
Q4_0 [ Prompt: 48.6 t/s | Generation: 42.6 t/s ] [ Prompt: 231.5 t/s | Generation: 36.6 t/s ] [ Prompt: 241.1 t/s | Generation: 34.0 t/s ]
UD_Q2_K_XL [ Prompt: 5.0 t/s | Generation: 21.1 t/s ] [ Prompt: 80.7 t/s | Generation: 29.2 t/s ] [ Prompt: 45.0 t/s | Generation: 24.4 t/s ] |
1080 ti 11GB / -sm layer
| $ ../../llama-b9553/llama-cli -m gemma-4-12b-it-Q4_0.gguf --model-draft gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4 -ngl 999 -fa on |
Q4_0 [ Prompt: 66.8 t/s | Generation: 28.5 t/s ] [ Prompt: 126.1 t/s | Generation: 19.3 t/s ] [ Prompt: 88.2 t/s | Generation: 16.3 t/s ]
UD_Q2_K_XL [ Prompt: 36.5 t/s | Generation: 24.6 t/s ] [ Prompt: 32.1 t/s | Generation: 17.1 t/s ] [ Prompt: 47.3 t/s | Generation: 12.6 t/s ] (한번 터졌음) |
>>>>> 참조용 >>>>>
하드웨어 1080 ti -sm none
gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.9s 27.94 t/s gemma-4 12B it Q4_0.gguf Reading Generation 255 tokens 8.9s 28.78 t/s gemma-4 12B it Q4_0.gguf Reading Generation 1,404 tokens 55s 25.45 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 29 tokens 1.2s 23.71 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 373 tokens 16s 22.28 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 806 tokens 37s 21.34 t/s (터짐) |
하드웨어 1080 ti -sm layer
gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.8s 31.04 t/s gemma-4 12B it Q4_0.gguf Reading Generation 265 tokens 9.0s 29.60 t/s gemma-4 12B it Q4_0.gguf Reading Generation 1,340 tokens 54s 24.43 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 31 tokens 1.3s 24.16 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 263 tokens 11s 23.70 t/s gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 620 tokens 29s 20.70 t/s (터짐) |
2026.06.04 - [프로그램 사용/ai 프로그램] - gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2
<<<< 참조용 <<<<
+
2026.06.18
gemma-4-12b-it-Q4_0.gguf + gemma-4-12B-it-MTP-Q8_0.gguf
| |
MTP x |
MTP 5 |
MTP 4 |
MTP 3 |
MTP 2 |
MTP 1 |
| 단문 |
31.4 |
37.6 |
37.0 |
40.2 |
42.3 |
42.5 |
| 중문 |
30.0 |
31.6 |
37.1 |
37.8 |
39.6 |
39.7 |
| 장문 |
28.5 |
29.6 |
35.4 |
34.4 |
37.1 |
37.0 |
gemma-4-12b-it-UD-Q2_K_XL.gguf + gemma-4-12B-it-MTP-Q8_0.gguf
| |
MTP x |
MTP 5 |
MTP 4 |
MTP 3 |
MTP 2 |
MTP 1 |
| 단문 |
24.8 |
27.0 |
29.7 |
32.8 |
32.2 |
32.9 |
| 중문 |
24.2 |
24.3 |
26.8 |
29.3 |
31.0 |
31.2 |
| 장문 |
23.0 |
22.2 |
26.7 |
26.0 |
30.7 |
28.7 |
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf -sm none
안녕? [ Prompt: 77.6 t/s | Generation: 31.1 t/s ] [ Prompt: 147.7 t/s | Generation: 31.4 t/s ]
너에 대해 설명해줘 [ Prompt: 197.2 t/s | Generation: 30.0 t/s ]
파이썬으로 셀레니움을 통해 웹을 서칭하고 텍스트만 추출하고 makrdown 으로 변환후 md 파일과 pdf로 저장하는 기능을 구현해줘 [ Prompt: 254.4 t/s | Generation: 28.5 t/s ]
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 5 -fit off -ngl 999 -fa on -sm none
안녕? [ Prompt: 74.1 t/s | Generation: 42.6 t/s ] [ Prompt: 127.1 t/s | Generation: 37.6 t/s ]
너에 대해 설명해줘 [ Prompt: 126.8 t/s | Generation: 31.6 t/s ]
파이썬으로 셀레니움을 통해 웹을 서칭하고 텍스트만 추출하고 makrdown 으로 변환후 md 파일과 pdf로 저장하는 기능을 구현해줘 [ Prompt: 240.4 t/s | Generation: 29.6 t/s ]
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4 -fit off -ngl 999 -fa on -sm none
안녕? [ Prompt: 75.8 t/s | Generation: 45.9 t/s ] [ Prompt: 115.5 t/s | Generation: 37.0 t/s ]
너에 대해 설명해줘 [ Prompt: 161.9 t/s | Generation: 37.1 t/s ]
파이썬으로 셀레니움을 통해 웹을 서칭하고 텍스트만 추출하고 makrdown 으로 변환후 md 파일과 pdf로 저장하는 기능을 구현해줘 [ Prompt: 257.1 t/s | Generation: 35.4 t/s ]
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 3 -fit off -ngl 999 -fa on -sm none
안녕 [ Prompt: 74.7 t/s | Generation: 42.7 t/s ] [ Prompt: 198.6 t/s | Generation: 40.2 t/s ]
너에 대해 설명해줘 [ Prompt: 146.5 t/s | Generation: 37.8 t/s ]
파이썬으로 셀레니움을 통해 웹을 서칭하고 텍스트만 추출하고 makrdown 으로 변환후 md 파일과 pdf로 저장하는 기능을 구현해줘 [ Prompt: 248.9 t/s | Generation: 34.4 t/s ]
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 2 -fit off -ngl 999 -fa on -sm none
안녕 [ Prompt: 75.1 t/s | Generation: 44.5 t/s ] [ Prompt: 115.3 t/s | Generation: 42.3 t/s ]
너에 대해 설명해줘 [ Prompt: 203.8 t/s | Generation: 39.6 t/s ]
파이썬으로 셀레니움을 통해 웹을 서칭하고 텍스트만 추출하고 makrdown 으로 변환후 md 파일과 pdf로 저장하는 기능을 구현해줘
[ Prompt: 250.9 t/s | Generation: 37.1 t/s ]
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-Q4_0.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 1 -fit off -ngl 999 -fa on -sm none
안녕? [ Prompt: 75.8 t/s | Generation: 41.5 t/s ] [ Prompt: 200.5 t/s | Generation: 42.5 t/s ]
너에 대해 설명해줘 [ Prompt: 132.7 t/s | Generation: 39.7 t/s ]
파이썬으로 셀레니움을 통해 웹을 서칭하고 텍스트만 추출하고 makrdown 으로 변환후 md 파일과 pdf로 저장하는 기능을 구현해줘 [ Prompt: 249.6 t/s | Generation: 37.0 t/s ] |
너에 대해 설명해줘
파이썬으로 셀레니움을 통해 웹을 서칭하고 텍스트만 추출하고 makrdown 으로 변환후 md 파일과 pdf로 저장하는 기능을 구현해줘
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf -sm none 세요!
[ Prompt: 50.7 t/s | Generation: 23.3 t/s ] [ Prompt: 136.9 t/s | Generation: 24.8 t/s ]
[ Prompt: 68.8 t/s | Generation: 24.2 t/s ]
[ Prompt: 96.2 t/s | Generation: 23.0 t/s ] 터짐
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 5 -fit off -ngl 999 -fa on -sm none [ Prompt: 50.0 t/s | Generation: 31.8 t/s ] [ Prompt: 142.0 t/s | Generation: 27.0 t/s ]
[ Prompt: 70.4 t/s | Generation: 24.3 t/s ]
[ Prompt: 90.8 t/s | Generation: 22.2 t/s ]
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4 -fit off -ngl 999 -fa on -sm none
[ Prompt: 50.0 t/s | Generation: 31.7 t/s ] [ Prompt: 131.0 t/s | Generation: 29.7 t/s ]
[ Prompt: 53.8 t/s | Generation: 26.8 t/s ]
[ Prompt: 97.6 t/s | Generation: 26.7 t/s ] 터짐
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 3 -fit off -ngl 999 -fa on -sm none
[ Prompt: 50.0 t/s | Generation: 31.8 t/s ] [ Prompt: 123.8 t/s | Generation: 32.8 t/s ]
[ Prompt: 62.0 t/s | Generation: 29.3 t/s ]
[ Prompt: 106.4 t/s | Generation: 26.0 t/s ]
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 2 -fit off -ngl 999 -fa on -sm none
까요? 궁금한 점이 있거나 도움이 필요하시면 편하게 말씀해 주세요!
[ Prompt: 33.7 t/s | Generation: 26.5 t/s ] [ Prompt: 159.8 t/s | Generation: 32.2 t/s ]
[ Prompt: 73.9 t/s | Generation: 31.0 t/s ]
[ Prompt: 96.7 t/s | Generation: 30.7 t/s ] 터짐
$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-12b/gemma-4-12b-it-UD-Q2_K_XL.gguf --model-draft /mnt/Downloads/model/gemma4-12b/gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 1 -fit off -ngl 999 -fa on -sm none
[ Prompt: 40.7 t/s | Generation: 32.3 t/s ] [ Prompt: 139.6 t/s | Generation: 32.9 t/s ]
[ Prompt: 69.6 t/s | Generation: 31.2 t/s ]
[ Prompt: 92.4 t/s | Generation: 28.7 t/s ] 터짐 |