변환해서 내꺼에서 돌려보니 성능 차이가 없...다?

내꺼 그래픽 카드가 구려서 그런가.. 그게 아니라면.. 변환을 잘못했다거나

llama.cpp 에서 지원은 안한다거나 그런건가?

 

  MTP x MTP 8 MTP 4 MTP 3 MTP 2 MTP 1
직접 61.1  18.6  40.9 58.4  55.6 61.7
unsloth 61.1   45.0  58.6  62.4  68.4 

 

-------

비교군

$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf  -sm none #--reasoning off                                                                                                   
...wnloads/llama-b9553/llama-cli       6435MiB
> 안녕?
[ Prompt: 105.7 t/s | Generation: 61.1 t/s ]

> 빨라?
[ Prompt: 51.1 t/s | Generation: 60.6 t/s ]

 

직접 변환(양자화 안함)

$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 8 -fit off -ngl 999 -fa on -sm none #--reasoning off
...wnloads/llama-b9553/llama-cli       6735MiB
> 안녕? 
[ Prompt: 101.1 t/s | Generation: 18.6 t/s ]

> 빨라?
[ Prompt: 351.2 t/s | Generation: 16.9 t/s ]

$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 4 -fit off -ngl 999 -fa on -sm none #--reasoning  off                                                                                                                   
...wnloads/llama-b9553/llama-cli       6735MiB
> 안녕?
[ Prompt: 292.5 t/s | Generation: 40.9 t/s ]

> 빨라?
[ Prompt: 207.7 t/s | Generation: 46.6 t/s ]


$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 3 -fit off -ngl 999 -fa on -sm none #--reasoning  off                                                                                                                   
...wnloads/llama-b9553/llama-cli       6735MiB
> 안녕? 
[ Prompt: 398.8 t/s | Generation: 58.4 t/s ]

> 빨라?
[ Prompt: 236.3 t/s | Generation: 60.9 t/s ]


$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 2 -fit off -ngl 999 -fa on -sm none #--reasoning off                                                                                                                   
...wnloads/llama-b9553/llama-cli       6735MiB
> 안녕?
[ Prompt: 360.7 t/s | Generation: 55.6 t/s ]

> 빨라?
[ Prompt: 284.9 t/s | Generation: 62.7 t/s ]

$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/gemma-4-E4B-it-assistant.gguf --spec-type draft-mtp --spec-draft-n-max 1 -fit off -ngl 999 -fa on -sm none #--reasoning off                                                                                                                   
...wnloads/llama-b9553/llama-cli       6735MiB
> 안녕?
[ Prompt: 314.1 t/s | Generation: 61.7 t/s ]  

> 빨라?
[ Prompt: 441.2 t/s | Generation: 63.7 t/s ]

 

unsloth 모델

[링크 : https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF]

$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 4 -fit off -ngl 999 -fa on -sm none #--reasoning off
...wnloads/llama-b9553/llama-cli       6666MiB
> 안녕?
[ Prompt: 42.4 t/s | Generation: 45.0 t/s ]

> 빨라?
[ Prompt: 302.6 t/s | Generation: 47.4 t/s ]


$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 3 -fit off -ngl 999 -fa on -sm none #--reasoning off
...wnloads/llama-b9553/llama-cli       6666MiB
> 안녕?
[ Prompt: 174.0 t/s | Generation: 58.6 t/s ]

> 빨라?
[ Prompt: 327.7 t/s | Generation: 60.2 t/s ]


$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 2 -fit off -ngl 999 -fa on -sm none #--reasoning off
...wnloads/llama-b9553/llama-cli       6666MiB
> 안녕?
[ Prompt: 98.5 t/s | Generation: 62.4 t/s ]

> 빨라?
[ Prompt: 331.4 t/s | Generation: 64.7 t/s ]  

$ /mnt/Downloads/llama-b9553/llama-cli --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf --model-draft ./gemma-4-E4B-it-assistant/mtp-gemma-4-E4B-it.gguf --spec-type draft-mtp --spec-draft-n-max 1 -fit off -ngl 999 -fa on -sm none #--reasoning off
...wnloads/llama-b9553/llama-cli       6666MiB
> 안녕?
[ Prompt: 168.7 t/s | Generation: 68.4 t/s ]  


> 빨라?
[ Prompt: 343.2 t/s | Generation: 67.2 t/s ]

 

[링크 : https://huggingface.co/google/gemma-4-E4B-it-assistant]

Posted by 구차니