변환해서 내꺼에서 돌려보니 성능 차이가 없...다?
내꺼 그래픽 카드가 구려서 그런가.. 그게 아니라면.. 변환을 잘못했다거나
llama.cpp 에서 지원은 안한다거나 그런건가?
| MTP x | MTP 8 | MTP 4 | MTP 3 | MTP 2 | MTP 1 | |
| 직접 | 61.1 | 18.6 | 40.9 | 58.4 | 55.6 | 61.7 |
| unsloth | 61.1 | 45.0 | 58.6 | 62.4 | 68.4 |
-------
비교군
| $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6435MiB > 안녕? [ Prompt: 105.7 t/s | Generation: 61.1 t/s ] > 빨라? [ Prompt: 51.1 t/s | Generation: 60.6 t/s ] |
직접 변환(양자화 안함)
| $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 8 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6735MiB > 안녕? [ Prompt: 101.1 t/s | Generation: 18.6 t/s ] > 빨라? [ Prompt: 351.2 t/s | Generation: 16.9 t/s ] $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 4 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6735MiB > 안녕? [ Prompt: 292.5 t/s | Generation: 40.9 t/s ] > 빨라? [ Prompt: 207.7 t/s | Generation: 46.6 t/s ] $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 3 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6735MiB > 안녕? [ Prompt: 398.8 t/s | Generation: 58.4 t/s ] > 빨라? [ Prompt: 236.3 t/s | Generation: 60.9 t/s ] $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 2 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6735MiB > 안녕? [ Prompt: 360.7 t/s | Generation: 55.6 t/s ] > 빨라? [ Prompt: 284.9 t/s | Generation: 62.7 t/s ] $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 1 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6735MiB > 안녕? [ Prompt: 314.1 t/s | Generation: 61.7 t/s ] > 빨라? [ Prompt: 441.2 t/s | Generation: 63.7 t/s ] |
unsloth 모델
[링크 : https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF]
| $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 4 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6666MiB > 안녕? [ Prompt: 42.4 t/s | Generation: 45.0 t/s ] > 빨라? [ Prompt: 302.6 t/s | Generation: 47.4 t/s ] $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 3 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6666MiB > 안녕? [ Prompt: 174.0 t/s | Generation: 58.6 t/s ] > 빨라? [ Prompt: 327.7 t/s | Generation: 60.2 t/s ] $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 2 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6666MiB > 안녕? [ Prompt: 98.5 t/s | Generation: 62.4 t/s ] > 빨라? [ Prompt: 331.4 t/s | Generation: 64.7 t/s ] $ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 1 -fit off -ngl 999 -fa on -sm none #--reasoning off ...wnloads/llama-b9553/llama-cli 6666MiB > 안녕? [ Prompt: 168.7 t/s | Generation: 68.4 t/s ] > 빨라? [ Prompt: 343.2 t/s | Generation: 67.2 t/s ] |
[링크 : https://huggingface.co/google/gemma-4-E4B-it-assistant]
'프로그램 사용 > ai 프로그램' 카테고리의 다른 글
| llama-swap 구현 (채팅) (0) | 2026.06.18 |
|---|---|
| openai api 변경에 따른 llama.cpp / llama-swap 리포트 차이 (0) | 2026.06.18 |
| llama-swap 버전 업데이트! (0) | 2026.06.18 |
| stable diffusion --device-id (0) | 2026.06.18 |
| stable diffusion illustruousXL LoRA (0) | 2026.06.15 |
