알아서 받고 한글도 정말 잘 변환해준다.

잠시 검색해보니 한국 회사인것 같고. hybe 자회사로 게임같은데서 보이스 체인저로 유명한 듯?

라이센스는 좀 읽어 봐야겠지만 대충 번역기 돌려서 보니 SaaS 까지도 허용하는 것 같긴한데..

 

outetts 처럼 빌드는 필요없이 그냥 pip만으로 설치되니 good!

그리고 auto_download 하면 먼가 열심히 받고 알아서 한다.

D:\study\llm>pip install supertonic
D:\study\llm>python
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from supertonic import TTS
>>> tts = TTS(auto_download=True)
Downloading (incomplete total...): 0.00B [00:00, ?B/s]                                                                 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Fetching 26 files: 100%|███████████████████████████████████████████████████████████████| 26/26 [00:36<00:00,  1.40s/it]
Download complete: : 404MB [00:36, 19.4MB/s]                                                                           >>> style = tts.get_voice_style(voice_name="M1")
>>>
>>> text = "A gentle breeze moved through the open window while everyone listened to the story."
>>> wav, duration = tts.synthesize(text, voice_style=style, lang="en")
>>>
>>> tts.save_audio(wav, "output.wav")
>>> print(f"Generated {duration:.2f}s of audio")

>>> text = "안녕? 난 잼미니야 만나서 반가워"
>>> wav, duration = tts.synthesize(text, voice_style=style, lang="ko")
>>> tts.save_audio(wav, "output_ko.wav")

 

[링크 : https://huggingface.co/Supertone/supertonic-3]

[링크 : https://www.supertone.ai/ko]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

vLLM  (0) 2026.05.10
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice 시도 실패  (0) 2026.05.10
outetts 시도  (0) 2026.05.10
huggingface 에서 다운로드 받기(python)  (0) 2026.05.10
stable diffusion 사용법  (0) 2026.05.09
Posted by 구차니

윈도우에서 하려고 했더니

step 1에서 바로 좌절. 먼가 그럼 미친듯이 깔지 말고 컴파일러 부터 확인하고 가라고!!! 버럭버럭!

D:\study\llm> pip install outetts
      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
error: failed-wheel-build-for-install

× Failed to build installable wheels for some pyproject.toml based projects
╰─> llama-cpp-python

D:\study\llm>

[링크 : https://github.com/edwko/OuteTTS?tab=readme-ov-file#installation]

[링크 : https://huggingface.co/unsloth/Llama-OuteTTS-1.0-1B]

 

 

Running the example
With both of the models generated, the LLM model and the voice decoder model,
we can run the example:

$ build/bin/llama-tts -m  ./models/outetts-0.2-0.5B-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75-f16.gguf \
    -p "Hello world"
...
main: audio written to file 'output.wav'

[링크 : https://git.comtegra.pl/ajastrzebski/llama-cpp/-/tree/master/examples/tts]

[링크 : https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/tree/main]

[링크 : https://huggingface.co/ggml-org/WavTokenizer/tree/main]

 

D:\study\llm\llama-b9093-bin-win-cuda-12.4-x64>llama-tts -m ..\OuteTTS-0.3-500M-Q8_0.gguf  -mv ..\WavTokenizer-Large-75-F16.gguf -p "hello i am sam. how are you?"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 6143 MiB):
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB
load_backend: loaded CUDA backend from D:\study\llm\llama-b9093-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\study\llm\llama-b9093-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\study\llm\llama-b9093-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB]   | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (GTX 1060 6GB) |  6143 = 5197 + ( 931 =   506 +      96 +     329) +          15 |
common_memory_breakdown_print: |   - Host                 |                  162 =   143 +       0 +      19                |
common_params_fit_impl: projected to use 931 MiB of device memory vs. 5197 MiB of free device memory
common_params_fit_impl: will leave 4265 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.44 seconds
llama_model_loader: loaded meta data with 25 key-value pairs and 290 tensors from ..\OuteTTS-0.3-500M-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = OuteTTS 0.3 500M
llama_model_loader: - kv   3:                           general.basename str              = OuteTTS-0.3
llama_model_loader: - kv   4:                         general.size_label str              = 500M
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,157696]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,157696]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151644
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151645
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = outetts-0.3
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q8_0:  169 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 506.02 MiB (8.50 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce GTX 1060 6GB) (0000:01:00.0) - 5197 MiB free
load: 0 unused tokens
load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load:   - 128247 ('</s>')
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 5152
load: token to piece cache size = 0.9712 MB
print_info: arch                  = qwen2
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 32768
print_info: n_embd                = 896
print_info: n_embd_inp            = 896
print_info: n_layer               = 24
print_info: n_head                = 14
print_info: n_head_kv             = 2
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 64
print_info: n_embd_head_v         = 64
print_info: n_gqa                 = 7
print_info: n_embd_k_gqa          = 128
print_info: n_embd_v_gqa          = 128
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: f_attn_value_scale    = 0.0000
print_info: n_ff                  = 4864
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 32768
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 1B
print_info: model params          = 499.19 M
print_info: general.name          = OuteTTS 0.3 500M
print_info: vocab type            = BPE
print_info: n_vocab               = 157696
print_info: n_merges              = 151387
print_info: BOS token             = 151644 '<|im_start|>'
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151645 '<|im_end|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 128247 '</s>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 23 repeating layers to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   143.17 MiB
load_tensors:        CUDA0 model buffer size =   506.07 MiB
..........................................................
common_init_result: added </s> logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 8192
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.60 MiB
llama_kv_cache:      CUDA0 KV buffer size =    96.00 MiB
llama_kv_cache: size =   96.00 MiB (  8192 cells,  24 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 64
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 64
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   329.26 MiB
sched_reserve:  CUDA_Host compute buffer size =    19.51 MiB
sched_reserve: graph nodes  = 823
sched_reserve: graph splits = 2
sched_reserve: reserve took 9.05 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB]   | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (GTX 1060 6GB) |  6143 = 4255 + ( 496 =   120 +       0 +     376) +        1392 |
common_memory_breakdown_print: |   - Host                 |                   36 =     4 +       0 +      32                |
common_params_fit_impl: projected to use 496 MiB of device memory vs. 4255 MiB of free device memory
common_params_fit_impl: will leave 3758 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took -0.78 seconds
llama_model_loader: loaded meta data with 25 key-value pairs and 161 tensors from ..\WavTokenizer-Large-75-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = wavtokenizer-dec
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = WavTokenizer Large Speech 75token
llama_model_loader: - kv   3:                           general.finetune str              = speech-75token
llama_model_loader: - kv   4:                           general.basename str              = WavTokenizer
llama_model_loader: - kv   5:                         general.size_label str              = large
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:               wavtokenizer-dec.block_count u32              = 12
llama_model_loader: - kv   8:            wavtokenizer-dec.context_length u32              = 8192
llama_model_loader: - kv   9:          wavtokenizer-dec.embedding_length u32              = 1282
llama_model_loader: - kv  10:      wavtokenizer-dec.attention.head_count u32              = 1
llama_model_loader: - kv  11: wavtokenizer-dec.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                wavtokenizer-dec.vocab_size u32              = 4096
llama_model_loader: - kv  14:           wavtokenizer-dec.features_length u32              = 512
llama_model_loader: - kv  15:       wavtokenizer-dec.feed_forward_length u32              = 2304
llama_model_loader: - kv  16: wavtokenizer-dec.attention.group_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  17: wavtokenizer-dec.attention.group_norm_groups u32              = 32
llama_model_loader: - kv  18:   wavtokenizer-dec.posnet.embedding_length u32              = 768
llama_model_loader: - kv  19:        wavtokenizer-dec.posnet.block_count u32              = 6
llama_model_loader: - kv  20: wavtokenizer-dec.convnext.embedding_length u32              = 768
llama_model_loader: - kv  21:      wavtokenizer-dec.convnext.block_count u32              = 12
llama_model_loader: - kv  22:          wavtokenizer-dec.attention.causal bool             = false
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = none
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  110 tensors
llama_model_loader: - type  f16:   51 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 124.15 MiB (16.03 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce GTX 1060 6GB) (0000:01:00.0) - 4255 MiB free
load: adding 4096 dummy tokens
print_info: arch                  = wavtokenizer-dec
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 8192
print_info: n_embd                = 512
print_info: n_embd_inp            = 512
print_info: n_layer               = 12
print_info: n_head                = 1
print_info: n_head_kv             = 1
print_info: n_rot                 = 512
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 512
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = 1
print_info: n_embd_k_gqa          = 512
print_info: n_embd_v_gqa          = 512
print_info: f_norm_eps            = 1.0e-06
print_info: f_norm_rms_eps        = 0.0e+00
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: f_attn_value_scale    = 0.0000
print_info: n_ff                  = 2304
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 0
print_info: pooling type          = -1
print_info: rope type             = -1
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 8192
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = ?B
print_info: model params          = 64.98 M
print_info: general.name          = WavTokenizer Large Speech 75token
print_info: vocab type            = no vocab
print_info: n_vocab               = 4096
print_info: n_merges              = 0
print_info: max token length      = 0
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 11 repeating layers to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors:   CPU_Mapped model buffer size =     4.00 MiB
load_tensors:        CUDA0 model buffer size =   120.15 MiB
.......................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 8192
llama_context: n_ubatch      = 8192
llama_context: causal_attn   = 0
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context:  CUDA_Host  output buffer size =     0.02 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   376.00 MiB
sched_reserve:  CUDA_Host compute buffer size =    32.03 MiB
sched_reserve: graph nodes  = 401
sched_reserve: graph splits = 2
sched_reserve: reserve took 14.06 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
sampler seed: 0
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 4, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
sampler chain: logits -> top-k -> dist
main: loading done
main: constructing prompt ..
main: prompt: 'hello<|space|>i<|space|>am<|space|>sam<|space|>how<|space|>are<|space|>you'


main: llama tokens: 151667, 198, 1782, 155780, 151929, 152412, 152308, 152585, 152460, 153375, 156777, 198, 74455, 155808, 151799, 151873, 151863, 152446, 152372, 152204, 152728, 152229, 152470, 151970, 153413, 152419, 153334, 153289, 153374, 153199, 152040, 153260, 152721, 152680, 153297, 152419, 153248, 152400, 152691, 153368, 153437, 156777, 198, 1722, 155828, 152607, 152256, 152991, 152299, 152688, 153163, 153016, 152789, 153198, 152712, 151911, 153107, 152623, 152170, 152395, 152852, 152207, 152461, 153321, 153309, 151750, 152137, 153340, 152573, 152267, 153347, 151789, 152681, 153339, 151992, 152512, 151751, 152179, 153434, 153180, 152900, 153440, 152474, 153122, 153129, 151904, 152311, 156777, 198, 1499, 155791, 152276, 152454, 153354, 152544, 153204, 153272, 152708, 153433, 152319, 153226, 153043, 152325, 153267, 152622, 156777, 198, 4250, 155797, 153454, 153342, 151989, 152458, 153420, 152303, 152271, 152827, 153036, 153196, 151708, 153263, 152561, 153207, 152213, 152112, 153204, 151722, 152542, 156777, 198, 19789, 155796, 153353, 153182, 152345, 152471, 152477, 153014, 152002, 152191, 151734, 152312, 152810, 152237, 153224, 153169, 153224, 152244, 153387, 153404, 156777, 198, 16069, 155811, 152265, 151946, 151808, 152412, 152363, 152305, 153156, 152733, 152810, 153157, 152016, 152100, 152069, 153234, 152317, 152589, 152707, 153121, 153341, 152159, 152114, 153156, 153001, 153504, 153376, 152272, 152433, 152325, 151941, 156777, 198, 285, 155788, 152238, 152255, 153427, 152318, 153009, 152381, 152474, 152680, 152157, 153255, 152324, 151682, 156777, 198, 32955, 155804, 153490, 153419, 152364, 152405, 152682, 152206, 152078, 153369, 152725, 153193, 153027, 152946, 152488, 153070, 151883, 152890, 152489, 153144, 153375, 152358, 151685, 152494, 152117, 152740, 156777, 198, 37448, 480, 155840, 151902, 152720, 153377, 152027, 152378, 152821, 153207, 153459, 153028, 153068, 152507, 153255, 152158, 152921, 151958, 152609, 152748, 152822, 152286, 151714, 152730, 152377, 152353, 152470, 152606, 152162, 152186, 153071, 152244, 153118, 153375, 153018, 152712, 153098, 152976, 152336, 151843, 153202, 152297, 151736, 153380, 153502, 152702, 152115, 153181, 152735, 153277, 153457, 152393, 153112, 152595, 156777, 198, 19098, 155808, 152464, 153452, 152595, 153312, 151937, 151933, 153197, 152239, 153163, 152922, 153402, 152034, 152591, 153438, 152215, 151673, 152005, 151785, 152642, 151924, 153278, 151805, 151974, 153482, 152718, 152862, 153347, 156777, 198, 72, 155780, 151795, 152111, 152746, 152377, 153471, 152309, 156777, 198, 19016, 155788, 153181, 152271, 152190, 152842, 152224, 152701, 152939, 152536, 152091, 151815, 152733, 151672, 156777, 198, 14689, 155788, 152291, 152072, 152942, 151734, 153042, 153504, 152589, 153333, 151839, 151941, 153038, 153180, 156777, 198, 36996, 8303, 155832, 152231, 152256, 152835, 152801, 152985, 153400, 152393, 152818, 152765, 152249, 152600, 151699, 152302, 152752, 153018, 153009, 151992, 153054, 152847, 153354, 153228, 152662, 153355, 152532, 153393, 151782, 152458, 152048, 152757, 152428, 153195, 151906, 153006, 153178, 153250, 152331, 152284, 152780, 153138, 153319, 151980, 153142, 152418, 152228, 152733, 156777, 198, 9096, 155801, 151698, 153321, 152217, 153039, 152935, 153400, 152122, 152531, 153106, 152169, 152892, 152957, 151851, 152427, 152826, 152451, 151851, 152901, 152885, 152594, 153446, 153080, 156777, 198, 14689, 155795, 152658, 151700, 153321, 152450, 152530, 153191, 151673, 151690, 151698, 152714, 152846, 152981, 153171, 153384, 153364, 153188, 153246, 156777, 198, 1055, 155779, 151869, 152388, 152711, 153334, 151736, 156777, 198, 1782, 155780, 153483, 153240, 152241, 152558, 152697, 153046, 156777, 198, 5804, 1363, 155820, 152941, 152764, 152605, 153034, 153434, 153372, 153347, 151887, 152453, 152758, 152133, 152510, 152694, 152431, 152321, 153088, 152676, 152223, 152581, 152459, 152015, 152502, 153063, 152712, 153294, 153451, 153032, 152903, 152859, 152989, 151748, 152669, 152661, 152650, 152409, 151861, 156777, 198, 300, 7973, 155828, 153095, 152469, 152988, 152894, 151819, 152391, 153019, 152058, 153062, 153230, 151826, 152112, 152306, 152264, 152769, 153390, 152384, 152435, 152790, 153393, 152983, 152540, 152252, 152034, 153107, 152540, 151919, 151893, 152558, 152817, 152946, 152956, 152129, 152715, 153131, 153490, 151734, 152271, 152707, 151734, 153321, 152450, 156777, 198, 8088, 155792, 152452, 153497, 153353, 152679, 152533, 152382, 152374, 152611, 153341, 153163, 152285, 153411, 152495, 153141, 152320, 156777, 198, 1199, 155781, 151764, 152360, 153295, 152634, 153342, 152199, 152271, 156777, 198, 43366, 155799, 152308, 151682, 152889, 152016, 152385, 152629, 152495, 151826, 153321, 152958, 152180, 151886, 153432, 152922, 152128, 153024, 153040, 152593, 152287, 151677, 156777, 198, 53660, 155808, 151727, 152092, 152680, 153331, 151699, 152316, 152938, 152289, 152433, 153384, 151781, 153137, 153259, 152175, 153213, 152291, 151869, 152691, 152489, 151941, 152049, 152034, 153053, 152179, 153160, 151676, 153367, 156777, 198, 268, 4123, 480, 155821, 152350, 152173, 152536, 151991, 151960, 153144, 153013, 152358, 152234, 153135, 152291, 153235, 152143, 152583, 152402, 153483, 152678, 152192, 152533, 152946, 151797, 153103, 152310, 152293, 151825, 152548, 153442, 152109, 152659, 153325, 152781, 152570, 152957, 151752, 152265, 153381, 152515, 156777, 198, 437, 155787, 152957, 152659, 151975, 152709, 152402, 152836, 152174, 151792, 153409, 153327, 152990, 156777, 198, 275, 155781, 152520, 153038, 152067, 153273, 153185, 152265, 152974, 156777, 198, 94273, 155799, 152953, 152938, 153427, 152244, 151920, 153423, 152929, 152367, 153052, 152129, 152331, 152257, 152987, 152777, 153448, 152408, 151696, 152408, 152326, 152699, 156777, 198, 385, 16239, 155828, 152306, 152268, 153438, 153228, 152978, 152957, 153153, 153393, 152795, 152110, 152918, 152923, 152467, 152331, 153053, 153330, 151889, 153444, 152234, 152624, 151779, 152801, 152784, 152139, 152222, 152751, 152512, 153287, 153141, 153052, 151840, 152589, 152508, 153499, 152109, 152255, 151739, 152267, 152759, 153318, 153165, 153349, 156777,


<|im_start|>
<|text_start|>the<|space|>overall<|space|>package<|space|>from<|space|>just<|space|>two<|space|>people<|space|>is<|space|>pretty<|space|>remarkable<|space|>sure<|space|>i<|space|>have<|space|>some<|space|>critiques<|space|>about<|space|>some<|space|>of<|space|>the<|space|>gameplay<|space|>aspects<|space|>but<|space|>its<|space|>still<|space|>really<|space|>enjoyable<|space|>and<|space|>it<|space|>looks<|space|>lovely<|space|>hello<|space|>i<|space|>am<|space|>sam<|space|>how<|space|>are<|space|>you<|text_end|>
<|audio_start|>
the<|t_0.08|><|257|><|740|><|636|><|913|><|788|><|1703|><|space|>
overall<|t_0.36|><|127|><|201|><|191|><|774|><|700|><|532|><|1056|><|557|><|798|><|298|><|1741|><|747|><|1662|><|1617|><|1702|><|1527|><|368|><|1588|><|1049|><|1008|><|1625|><|747|><|1576|><|728|><|1019|><|1696|><|1765|><|space|>
package<|t_0.56|><|935|><|584|><|1319|><|627|><|1016|><|1491|><|1344|><|1117|><|1526|><|1040|><|239|><|1435|><|951|><|498|><|723|><|1180|><|535|><|789|><|1649|><|1637|><|78|><|465|><|1668|><|901|><|595|><|1675|><|117|><|1009|><|1667|><|320|><|840|><|79|><|507|><|1762|><|1508|><|1228|><|1768|><|802|><|1450|><|1457|><|232|><|639|><|space|>
from<|t_0.19|><|604|><|782|><|1682|><|872|><|1532|><|1600|><|1036|><|1761|><|647|><|1554|><|1371|><|653|><|1595|><|950|><|space|>
just<|t_0.25|><|1782|><|1670|><|317|><|786|><|1748|><|631|><|599|><|1155|><|1364|><|1524|><|36|><|1591|><|889|><|1535|><|541|><|440|><|1532|><|50|><|870|><|space|>
two<|t_0.24|><|1681|><|1510|><|673|><|799|><|805|><|1342|><|330|><|519|><|62|><|640|><|1138|><|565|><|1552|><|1497|><|1552|><|572|><|1715|><|1732|><|space|>
people<|t_0.39|><|593|><|274|><|136|><|740|><|691|><|633|><|1484|><|1061|><|1138|><|1485|><|344|><|428|><|397|><|1562|><|645|><|917|><|1035|><|1449|><|1669|><|487|><|442|><|1484|><|1329|><|1832|><|1704|><|600|><|761|><|653|><|269|><|space|>
is<|t_0.16|><|566|><|583|><|1755|><|646|><|1337|><|709|><|802|><|1008|><|485|><|1583|><|652|><|10|><|space|>
pretty<|t_0.32|><|1818|><|1747|><|692|><|733|><|1010|><|534|><|406|><|1697|><|1053|><|1521|><|1355|><|1274|><|816|><|1398|><|211|><|1218|><|817|><|1472|><|1703|><|686|><|13|><|822|><|445|><|1068|><|space|>
remarkable<|t_0.68|><|230|><|1048|><|1705|><|355|><|706|><|1149|><|1535|><|1787|><|1356|><|1396|><|835|><|1583|><|486|><|1249|><|286|><|937|><|1076|><|1150|><|614|><|42|><|1058|><|705|><|681|><|798|><|934|><|490|><|514|><|1399|><|572|><|1446|><|1703|><|1346|><|1040|><|1426|><|1304|><|664|><|171|><|1530|><|625|><|64|><|1708|><|1830|><|1030|><|443|><|1509|><|1063|><|1605|><|1785|><|721|><|1440|><|923|><|space|>
sure<|t_0.36|><|792|><|1780|><|923|><|1640|><|265|><|261|><|1525|><|567|><|1491|><|1250|><|1730|><|362|><|919|><|1766|><|543|><|1|><|333|><|113|><|970|><|252|><|1606|><|133|><|302|><|1810|><|1046|><|1190|><|1675|><|space|>
i<|t_0.08|><|123|><|439|><|1074|><|705|><|1799|><|637|><|space|>
have<|t_0.16|><|1509|><|599|><|518|><|1170|><|552|><|1029|><|1267|><|864|><|419|><|143|><|1061|><|0|><|space|>
some<|t_0.16|><|619|><|400|><|1270|><|62|><|1370|><|1832|><|917|><|1661|><|167|><|269|><|1366|><|1508|><|space|>
critiques<|t_0.60|><|559|><|584|><|1163|><|1129|><|1313|><|1728|><|721|><|1146|><|1093|><|577|><|928|><|27|><|630|><|1080|><|1346|><|1337|><|320|><|1382|><|1175|><|1682|><|1556|><|990|><|1683|><|860|><|1721|><|110|><|786|><|376|><|1085|><|756|><|1523|><|234|><|1334|><|1506|><|1578|><|659|><|612|><|1108|><|1466|><|1647|><|308|><|1470|><|746|><|556|><|1061|><|space|>
about<|t_0.29|><|26|><|1649|><|545|><|1367|><|1263|><|1728|><|450|><|859|><|1434|><|497|><|1220|><|1285|><|179|><|755|><|1154|><|779|><|179|><|1229|><|1213|><|922|><|1774|><|1408|><|space|>
some<|t_0.23|><|986|><|28|><|1649|><|778|><|858|><|1519|><|1|><|18|><|26|><|1042|><|1174|><|1309|><|1499|><|1712|><|1692|><|1516|><|1574|><|space|>
of<|t_0.07|><|197|><|716|><|1039|><|1662|><|64|><|space|>
the<|t_0.08|><|1811|><|1568|><|569|><|886|><|1025|><|1374|><|space|>
gameplay<|t_0.48|><|1269|><|1092|><|933|><|1362|><|1762|><|1700|><|1675|><|215|><|781|><|1086|><|461|><|838|><|1022|><|759|><|649|><|1416|><|1004|><|551|><|909|><|787|><|343|><|830|><|1391|><|1040|><|1622|><|1779|><|1360|><|1231|><|1187|><|1317|><|76|><|997|><|989|><|978|><|737|><|189|><|space|>
aspects<|t_0.56|><|1423|><|797|><|1316|><|1222|><|147|><|719|><|1347|><|386|><|1390|><|1558|><|154|><|440|><|634|><|592|><|1097|><|1718|><|712|><|763|><|1118|><|1721|><|1311|><|868|><|580|><|362|><|1435|><|868|><|247|><|221|><|886|><|1145|><|1274|><|1284|><|457|><|1043|><|1459|><|1818|><|62|><|599|><|1035|><|62|><|1649|><|778|><|space|>
but<|t_0.20|><|780|><|1825|><|1681|><|1007|><|861|><|710|><|702|><|939|><|1669|><|1491|><|613|><|1739|><|823|><|1469|><|648|><|space|>
its<|t_0.09|><|92|><|688|><|1623|><|962|><|1670|><|527|><|599|><|space|>
still<|t_0.27|><|636|><|10|><|1217|><|344|><|713|><|957|><|823|><|154|><|1649|><|1286|><|508|><|214|><|1760|><|1250|><|456|><|1352|><|1368|><|921|><|615|><|5|><|space|>
really<|t_0.36|><|55|><|420|><|1008|><|1659|><|27|><|644|><|1266|><|617|><|761|><|1712|><|109|><|1465|><|1587|><|503|><|1541|><|619|><|197|><|1019|><|817|><|269|><|377|><|362|><|1381|><|507|><|1488|><|4|><|1695|><|space|>
enjoyable<|t_0.49|><|678|><|501|><|864|><|319|><|288|><|1472|><|1341|><|686|><|562|><|1463|><|619|><|1563|><|471|><|911|><|730|><|1811|><|1006|><|520|><|861|><|1274|><|125|><|1431|><|638|><|621|><|153|><|876|><|1770|><|437|><|987|><|1653|><|1109|><|898|><|1285|><|80|><|593|><|1709|><|843|><|space|>
and<|t_0.15|><|1285|><|987|><|303|><|1037|><|730|><|1164|><|502|><|120|><|1737|><|1655|><|1318|><|space|>
it<|t_0.09|><|848|><|1366|><|395|><|1601|><|1513|><|593|><|1302|><|space|>
looks<|t_0.27|><|1281|><|1266|><|1755|><|572|><|248|><|1751|><|1257|><|695|><|1380|><|457|><|659|><|585|><|1315|><|1105|><|1776|><|736|><|24|><|736|><|654|><|1027|><|space|>
lovely<|t_0.56|><|634|><|596|><|1766|><|1556|><|1306|><|1285|><|1481|><|1721|><|1123|><|438|><|1246|><|1251|><|795|><|659|><|1381|><|1658|><|217|><|1772|><|562|><|952|><|107|><|1129|><|1112|><|467|><|550|><|1079|><|840|><|1615|><|1469|><|1380|><|168|><|917|><|836|><|1827|><|437|><|583|><|67|><|595|><|1087|><|1646|><|1493|><|1677|><|space|>main: prompt size: 871

main: time for prompt: 252.929 ms

000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

main: time for decoder:       1412.620 ms
common_perf_print:    sampling time =      66.79 ms
common_perf_print:    samplers time =      26.02 ms /   199 tokens
common_perf_print:        load time =     655.95 ms
common_perf_print: prompt eval time =     234.97 ms /   871 tokens (    0.27 ms per token,  3706.90 tokens per second)
common_perf_print:        eval time =    1341.80 ms /   198 runs   (    6.78 ms per token,   147.56 tokens per second)
common_perf_print:       total time =    1075.97 ms /  1069 tokens
common_perf_print: unaccounted time =    -567.58 ms / -52.8 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =        196
common_memory_breakdown_print: | memory breakdown [MiB]   | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (GTX 1060 6GB) |  6143 = 3733 + ( 931 =   506 +      96 +     329) +        1479 |
common_memory_breakdown_print: |   - Host                 |                  162 =   143 +       0 +      19                |

codes: '
hello<|t_0.96|><|865|><|1506|><|865|><|1419|><|1819|><|838|><|624|><|1251|><|899|><|954|><|1096|><|710|><|1152|><|1418|><|710|><|1301|><|1120|><|17|><|1456|><|1405|><|776|><|1668|><|1390|><|86|><|1292|><|1023|><|1683|><|1589|><|1092|><|1556|><|1479|><|1294|><|1292|><|805|><|1683|><|1430|><|900|><|1714|><|995|><|1294|><|1432|><|1007|><|1622|><|1120|><|861|><|1803|><|995|><|1092|><|1668|><|710|><|1433|><|933|><|670|><|32|><|1293|><|1251|><|1134|><|1701|><|1347|><|816|><|642|><|95|><|508|><|48|><|503|><|653|><|1707|><|1041|><|267|><|1817|><|248|><|1754|><|space|>
i<|t_0.28|><|73|><|642|><|169|><|614|><|983|><|169|><|843|><|443|><|1092|><|752|><|252|><|1378|><|1315|><|221|><|1448|><|1083|><|565|><|866|><|93|><|767|><|1697|><|space|>
am<|t_0.16|><|422|><|852|><|408|><|847|><|1007|><|550|><|874|><|673|><|191|><|127|><|220|><|716|><|space|>
sam<|t_0.43|><|775|><|487|><|646|><|519|><|493|><|1513|><|1|><|1166|><|640|><|556|><|0|><|1061|><|18|><|333|><|719|><|632|><|693|><|907|><|430|><|1312|><|1086|><|1098|><|1333|><|974|><|816|><|440|><|1755|><|1324|><|1534|><|662|><|1812|><|385|><|space|>
how<|t_0.20|><|1663|><|1028|><|1488|><|1314|><|1393|><|1723|><|1303|><|1497|><|951|><|1181|><|789|><|142|><|1475|><|66|><|297|><|space|>
are<|t_0.13|><|798|><|1803|><|562|><|123|><|756|><|968|><|381|><|890|><|1773|><|1039|><|space|>
you<|t_0.08|><|193|><|92|><|1221|><|1334|><|562|><|1415|>
<|audio_end|>
<|im_end|>'
main: codes size: 199
codes audio: '<|865|><|1506|><|865|><|1419|><|1819|><|838|><|624|><|1251|><|899|><|954|><|1096|><|710|><|1152|><|1418|><|710|><|1301|><|1120|><|17|><|1456|><|1405|><|776|><|1668|><|1390|><|86|><|1292|><|1023|><|1683|><|1589|><|1092|><|1556|><|1479|><|1294|><|1292|><|805|><|1683|><|1430|><|900|><|1714|><|995|><|1294|><|1432|><|1007|><|1622|><|1120|><|861|><|1803|><|995|><|1092|><|1668|><|710|><|1433|><|933|><|670|><|32|><|1293|><|1251|><|1134|><|1701|><|1347|><|816|><|642|><|95|><|508|><|48|><|503|><|653|><|1707|><|1041|><|267|><|1817|><|248|><|1754|><|73|><|642|><|169|><|614|><|983|><|169|><|843|><|443|><|1092|><|752|><|252|><|1378|><|1315|><|221|><|1448|><|1083|><|565|><|866|><|93|><|767|><|1697|><|422|><|852|><|408|><|847|><|1007|><|550|><|874|><|673|><|191|><|127|><|220|><|716|><|775|><|487|><|646|><|519|><|493|><|1513|><|1|><|1166|><|640|><|556|><|0|><|1061|><|18|><|333|><|719|><|632|><|693|><|907|><|430|><|1312|><|1086|><|1098|><|1333|><|974|><|816|><|440|><|1755|><|1324|><|1534|><|662|><|1812|><|385|><|1663|><|1028|><|1488|><|1314|><|1393|><|1723|><|1303|><|1497|><|951|><|1181|><|789|><|142|><|1475|><|66|><|297|><|798|><|1803|><|562|><|123|><|756|><|968|><|381|><|890|><|1773|><|1039|><|193|><|92|><|1221|><|1334|><|562|><|1415|>'
main: codes audio size: 168
main: time for vocoder:      220.671 ms
main: time for spectral ops: 850.860 ms
main: total time:            2737.218 ms
main: audio written to file 'output.wav'

 

영어는 잘되는데 한글은 잘 안되는 듯.

output.wav
0.10MB

 

GPT 통해서 영어로 "안녕? 난 잼미니야 만나서 반가워" 를 TTS에 유리하게 바꾸어 달라고 했는데

'안녕? 난' 은 날아가고 '잼미니야 맨나서 빵가워' 정도로 들린다.

D:\study\llm\llama-b9093-bin-win-cuda-12.4-x64>llama-tts -m ..\OuteTTS-0.3-500M-Q8_0.gguf  -mv ..\WavTokenizer-Large-75-F16.gguf -p "Annyoung? Nahn Jemmini-ya. Mannaseo bangawo."

 

hello_ko.wav
0.15MB

 

 

+

허깅페이스에서 타입에 아예 tts가 있었군.

[링크 : https://huggingface.co/models?pipeline_tag=text-to-speech]

Posted by 구차니

git clone  하듯 받을수 있을것 같기도 한데..

아무튼 파이썬을 통해서 한번에 받는 방법

 

step 1.

hugging face 아래에 복사 아이콘 (like 2 왼쪽) 을 누른다.

[링크 : https://huggingface.co/lysandre/arxiv-nlp]

 

step 2.

pip 로 huggingface_hub 패키지를 설치한다.

 

step 3. 

python 실행해서 아래를 따라한다.

전체 리포지토리 다운로드하기
snapshot_download() 함수는 특정 버전의 전체 리포지토리를 다운로드합니다. 이 함수는 내부적으로 hf_hub_download() 함수를 사용하므로, 다운로드한 모든 파일은 로컬 디스크에 캐시되어 저장됩니다. 다운로드는 여러 파일을 동시에 받아오기 때문에 빠르게 진행됩니다.

전체 리포지토리를 다운로드하려면 repo_id와 repo_type을 인자로 넘겨주면 됩니다:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="lysandre/arxiv-nlp")

[링크 : https://huggingface.co/docs/huggingface_hub/ko/guides/download]

 

음.. 걍 하나씩 받을까 -_-

Download complete: : 2.52GB [03:13, 50.8MB/s]              'C:\\Users\\minimonk\\.cache\\huggingface\\hub\\models--unsloth--Llama-OuteTTS-1.0-1B\\snapshots\\52b901173c6cf817148fdaef981e52408332c3ca'

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

supertone/supertonic3 시도  (0) 2026.05.10
outetts 시도  (0) 2026.05.10
stable diffusion 사용법  (0) 2026.05.09
stable diffusion python service  (0) 2026.05.08
opencode + qwen3.6 35b q2 사용 테스트  (0) 2026.05.08
Posted by 구차니

[링크 : https://selgyun.tistory.com/4]

 

(키워드) - 강화

[키워드] -약화

batch - 동시에 여러개 이미지 생성

[링크 : https://selgyun.tistory.com/5] txt2img

[링크 : https://selgyun.tistory.com/6] img2img

[링크 : https://selgyun.tistory.com/7] Lora - 모델타입?

 

+

2026.05.11

모델 받는 법, 종류

[링크 : https://healtable.tistory.com/7]

 

+

--ckpt CKPT model.ckpt Path to checkpoint of Stable Diffusion model; if specified, this checkpoint will be added to the list of checkpoints and loaded.
--ckpt-dir CKPT_DIR None Path to directory with Stable Diffusion checkpoints.

[링크 : https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Command-Line-Arguments-and-Settings]

[링크 : https://www.reddit.com/r/StableDiffusion/comments/117p89v/how_can_i_choose_which_checkpoint_to_load_when/?tl=ko]

 

+

VAE stands for variational autoencoder
EMA and MSE. (Exponential Moving Average and Mean Square Error)

[링크 : https://stable-diffusion-art.com/automatic1111/]

[링크 : https://stable-diffusion-art.com/how-to-use-vae/]

[링크 : https://huggingface.co/stabilityai/sd-vae-ft-ema#visual]

 

지금 이거랑 비슷한 현상인데 봐야겠다.

[링크 : https://samablog.tistory.com/m/203]

 

+

이전 모델이 머였는지 모르겠는데, 새로 받은건 SDXL 계열이고.. 그래서 VAE를 automatic으로 해주라는데

어디서 설정을 하는건지 못 찾는 중..

Illustrious XL is an advanced Stable Diffusion XL (SD XL)-based model, developed by OnomaAI Research, optimized specifically for illustration and animation tasks. It is built upon the Kohaku XL-Beta - Revision 5 checkpoint, leveraging its robust foundation to deliver high-quality generative capabilities.

[링크 : https://civitai.com/models/795765/illustrious-xl]

 

gpt 에서 보면 VAE가 의심되긴한데 CLIP도 문제라지만..fp16 VAE(half-vae ?) 문제일수도 있다는데

일단 내가 한건 모델만 바꾼거라.. vae와 매칭이 안되는게 의심되긴 한다.

1. VAE가 모델과 맞지 않는 경우

가장 흔한 원인입니다.

예를 들어:

SD1.5 계열 모델
SDXL 계열 모델
애니 특화 모델
실사 특화 모델

각각 latent 분포가 조금 다릅니다.

그런데 다른 계열 VAE를 연결하면:

색감 이상
얼굴 붕괴
디테일 손실
프롬프트 반영 약화
채도 과다/부족
눈/손 이상

등이 발생합니다.

예:

SDXL 모델에 SD1.5 VAE 사용
Anything 계열에 실사 VAE 사용

→ “고양이”를 입력했는데 이상한 생물처럼 나오는 경우도 있습니다.


5. 실제로는 CLIP 문제인 경우도 많음

많은 사용자가 VAE 문제라고 생각하지만 실제론:

CLIP 인코딩
CFG scale
sampler
step 수
LoRA 충돌

문제인 경우도 매우 많습니다.

예:

CFG 너무 높음 → 과적합
Euler a → 프롬프트 드리프트
LoRA weight 과다 → 키워드 왜곡

그러면:

"1girl, red hair"

인데 갑자기:

배경 과다
캐릭터 여러 명
이상한 의상

등이 나옵니다.



6. fp16 VAE 문제

저사양 환경에서 흔합니다.

fp16 VAE는:

VRAM 절약
속도 향상

장점이 있지만:

검은 이미지
washed out
NaN
디테일 손실

이 발생하기도 합니다.

특히 AMD/구형 CUDA에서 자주 보입니다.

[링크 : https://chatgpt.com/share/6a01e096-dc48-83e9-8aa3-cfe937d3b9e6]

 

 

 

+

2026.05.12

ft-MSE가 내눈에는 좋아 보인다.

[링크 : https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main]

[링크 : https://huggingface.co/stabilityai/sd-vae-ft-mse-original/tree/main]

[링크 : https://huggingface.co/stabilityai/sd-vae-ft-ema/tree/main]

[링크 : https://huggingface.co/stabilityai/sd-vae-ft-ema-original/tree/main]

 

+

[링크 : https://www.reddit.com/r/StableDiffusion/comments/1594hbj/minimum_nonsquare_resolution_with_sdxl/]

 

+

--medvram-sdxl 옵션을 주니까 vram에서 내려놓고 메인 메모리에 올려둔다. 어쩐지 시스템 메모리가 부족하더라 -_-

Posted by 구차니

 

from diffusers import StableDiffusionPipeline

[링크 : https://www.reddit.com/r/Python/comments/10g5nay/use_python_to_build_a_free_stable_diffusion_app/?tl=ko]

[링크 : https://www.assemblyai.com/blog/build-a-free-stable-diffusion-app-with-a-gpu-backend]

 

매모리 터져나갈게 보이니 e4b는 gpu 0번에서

stable diffusion은 gpu 1번에서 돌려서 두개 연동해

img2img로 장난치거나 그림 그려 가 포함되면 txt2img로 돌려도 괜찮을듯

---

by gpt

api 모드를 활성화 해주고

./webui.sh --api

 

txt2img

import requests
import base64

url = "http://127.0.0.1:7860/sdapi/v1/txt2img"

payload = {
    "prompt": "masterpiece, ultra detailed, cyberpunk girl, neon city, rain",
    "negative_prompt": "low quality, blurry",
    "steps": 30,
    "width": 768,
    "height": 768,
    "cfg_scale": 7,
    "sampler_name": "DPM++ 2M Karras"
}

response = requests.post(url, json=payload)

result = response.json()

# base64 이미지 저장
image_data = base64.b64decode(result["images"][0])

with open("generated.png", "wb") as f:
    f.write(image_data)

print("generated.png 저장 완료")

 

오.. 된다.

 

img2img

import requests
import base64

# 입력 이미지 읽기
with open("input.png", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

url = "http://127.0.0.1:7860/sdapi/v1/img2img"

payload = {
    "init_images": [image_base64],

    "prompt": "cyberpunk style, neon lights, futuristic",
    "negative_prompt": "low quality, blurry",

    "denoising_strength": 0.55,

    "steps": 30,
    "cfg_scale": 7,
    "width": 768,
    "height": 768,
    "sampler_name": "DPM++ 2M Karras"
}

response = requests.post(url, json=payload)

result = response.json()

image_data = base64.b64decode(result["images"][0])

with open("modified.png", "wb") as f:
    f.write(image_data)

print("modified.png 저장 완료")

 

[링크 : https://chatgpt.com/share/69fd90a2-2bbc-83e9-8f04-6cecdddc1b41]

Posted by 구차니

llama-swap / llama.cpp 설정

이상하게도 gemma4-e2b의 경우 용량이 작아서 그런가 -sm none을 안주면 오히려 에러가 난다.

$ cat config.yaml 
models:
  gemma4-e2b:
    cmd: /mnt/Downloads/llama-b8925/llama-server --port ${PORT} --model /mnt/Downloads/model/gemma4-e2b/gemma-4-E2B-it-Q4_K_M.gguf -mm ./model/gemma4-e2b/mmproj-F16.gguf -sm none
  gemma4-e4b:
    cmd: /mnt/Downloads/llama-b8925/llama-server --port ${PORT} --model /mnt/Downloads/model/gemma4-e4b/gemma-4-E4B-it-Q4_K_M.gguf -mm ./model/gemma4-e4b/mmproj-F16.gguf
  gemma4-26B_img:
    cmd: /mnt/Downloads/llama-b8925/llama-server --port ${PORT} --model /mnt/Downloads/model/gemma4-26b/gemma-4-26B-A4B-it-UD-IQ2_M.gguf -mm ./model/gemma4-26b/mmproj-F16.gguf
  qwen3.6-35b_img:
    cmd: /mnt/Downloads/llama-b8925/llama-server --port ${PORT} --model /mnt/Downloads/model/qwen3.6_35B/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf -mm ./model/qwen3.6_35B/mmproj-F16.gguf
  qwen3.6-35b:
    cmd: /mnt/Downloads/llama-b8925/llama-server --port ${PORT} --model /mnt/Downloads/model/qwen3.6_35B/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf

 

opencode 설정

이걸 추가한다고 바로 되는건 아닌 것 같고

~/.config/opencode$ cat opencode.json 
{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "gemma4-e2b": {
          "name": "gemma4 e2b (local)",
          "limit": {
            "context": 128000,
            "output": 65536
          }
        }
      }
    }
  }
}

[링크 : https://opencode.ai/docs/providers#llamacpp]

 

opencode 에서 /connect 명령을 통해 좀 더 해주어야 로컬에서 돌아간다.

 

 

 

 

심심하니 한번 로컬로 작성 시도!

 

오오 로컬 잡아먹는다 ㅋㅋㅋ

(요건 gpu0 에 e2b / gpu1에는 stable diffusion)

 

opencode.json 파일 추가!

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "gemma4-e2b": {
          "name": "gemma4 e2b (local)",
          "limit": {
            "context": 128000,
            "output": 65536
          }
        },
        "qwen3.6-35b": {
          "name": "qwen3.6-35b (local)",
          "limit": {
            "context": 65536,
            "output": 65536
          }
        }
      }
    }
  }
}

 

gemma4-e2b 에서 qwen3.6-35b로 환승!

 

인사 받기 참 힘드네 ㅋㅋㅋ

 

그래도 100% 꾹꾹 하는데 소비전력은 150 W 급 * 2 이라..

Posted by 구차니

엥.. 왜 갑자기 안되지 싶은데

-sm none 으로 해결

Posted by 구차니

0 번 부터 시작임

Launching Web UI with arguments: --server-name=0.0.0.0 --device-id=1

[링크 : https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/10665]

Posted by 구차니

다시 해보니 telegram은 ogg로 보내는데 이걸 넣으면 wav가 아니라서 배째는거라

ffmpeg을 통해서 ogg를 wav로 바꾸고 하니 먼가 처리를 한다.

다만 소리 크기가 작아서 인가 은근히 인식율이 떨이지네.. (gemma4:e2b)

 

----

ollama 통해서 하고 있는데 미구현인가..

아니면  gemma4:e2b가 ollama에서 할 때 오디오는 미지원인건가... 이건 어떻게 확인하지?

그 와중에 왜 이미지로 인식하냐?

--- [Prompt to AI] (Model: gemma4:e2b) ---
[{'role': 'user', 'content': [{'type': 'text', 'text': '이 내용을 분석해줘'}, {'type': 'input_audio', 'input_audio': {'data': 'encoded data', 'format': 'wav'}}]}]


 {'error': {'message': 'Failed to create new sequence: failed to process inputs: imageunknown format', 'type': 'api_error', 'param': None, 'code': None}}
 

 

$ ollama show gemma4:e2b
  Model
    architecture        gemma4    
    parameters          5.1B      
    context length      131072    
    embedding length    1536      
    quantization        Q4_K_M    
    requires            0.20.0    

  Capabilities
    completion    
    vision        
    audio         
    tools         
    thinking      

  Parameters
    temperature    1       
    top_k          64      
    top_p          0.95    

  License
    Apache License               
    Version 2.0, January 2004    
    ...                          

 

import OpenAI from "openai";
const openai = new OpenAI();

// Fetch an audio file and convert it to a base64 string
const url = "https://cdn.openai.com/API/docs/audio/alloy.wav";
const audioResponse = await fetch(url);
const buffer = await audioResponse.arrayBuffer();
const base64str = Buffer.from(buffer).toString("base64");

const response = await openai.chat.completions.create({
  model: "gpt-audio",
  modalities: ["text", "audio"],
  audio: { voice: "alloy", format: "wav" },
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What is in this recording?" },
        { type: "input_audio", input_audio: { data: base64str, format: "wav" }}
      ]
    }
  ],
  store: true,
});

console.log(response.choices[0]);

[링크 : https://developers.openai.com/api/docs/guides/audio?example=audio-in]

 

 

[링크 : https://community.openai.com/t/how-can-i-pass-a-system-prompt-and-audio-user-input-to-get-a-text-output-back/1002483/3]

[링크 : https://github.com/ggml-org/llama.cpp/issues/21868]

Posted by 구차니

어떻게 쓰지 싶었는데, 2025년 5월 15일 내용에 이미 아래의 모델들이 지원되고 있었다

[링크 : https://ollama.com/blog/multimodal-models]

[링크 :https://ollama.com/search?c=vision]

 

아래 파일 경로가

/home/minimonk/1804-lubuntu-default-wallpaper.png 인데

절대경로 넣어주면 알아서 읽어서 본다.

헐..(!)

 

$ ollama run gemma4:e2b
>>> what do you see in this image?  /home/minimonk/1804-lubuntu-default-wall
... paper.png
Added image '/home/minimonk/1804-lubuntu-default-wallpaper.png'
Thinking...
Here's a thinking process to analyze the image and generate the 
description:

1.  **Analyze the Request:** The user provided an image and asked, "what 
do you see in this image?"

2.  **Examine the Image (Visual Analysis):**
    *   **Overall Impression/Mood:** The image is dark, deep, atmospheric, 
and somewhat cosmic or abstract.
    *   **Colors:** The dominant colors are deep blues (teal/cyan), 
vibrant purples (magenta/violet), and some lighter, soft areas.
    *   **Texture/Pattern:** The background is not uniformly smooth. It 
features nebulous areas and scattered dots.
    *   **Specific Elements:**
        *   **Background:** Deep blue/purple gradient.
        *   **Texture/Effects:** There are faint, wispy clouds or nebulas, 
especially in the upper right area, suggesting a cosmic or smoky effect.
        *   **Particles:** Numerous small, scattered white or 
light-colored dots (specks of light/stars/dust) are distributed across the 
frame.
        *   **Light/Glow:** There is a soft, glowing, lighter purple/pink 
area, particularly visible towards the right.

3.  **Synthesize the Description (Drafting the Response):** Combine the 
observations into a coherent description. Start with the general theme and 
then detail the specific elements.

    *   *(Initial thought: It's a space background.)*
    *   *(Refinement: It's an abstract digital art piece, often called a 
cosmic or nebula background.)*

4.  **Review and Refine (Adding evocative language):** Ensure the 
description is accurate and engaging. (Self-Correction: Avoid making 
definitive claims like "it is space," but use terms that imply it, like 
"nebula" or "cosmic.")

5.  **Final Output Generation.** (This leads to the provided good 
response.)
...done thinking.

This image is an **abstract, atmospheric background**, often resembling a 
**cosmic or nebula scene**.

Here is a detailed breakdown of what I see:

1.  **Color Palette:** The dominant colors are deep and rich, featuring a 
strong gradient blend of **deep blues** and **vibrant purples/magentas**.
2.  **Texture/Atmosphere:** The background has a soft, smoky, or nebulous 
quality, suggesting deep space or an ethereal cloud formation. There are 
lighter, diffused areas, particularly on the right side, which glow with 
pink and purple hues.
3.  **Particles:** The image is scattered with numerous small, bright 
**white dots**, which give the impression of stars, dust, or glittering 
particles floating in the dark space.
4.  **Overall Mood:** The image evokes a feeling of depth, mystery, 
serenity, and vastness. It is highly typical of digital art used as 
wallpapers or abstract graphics.

 

llama.cpp로 개고생 하던게 이렇게 쉽게 되다니 ㅠㅠ

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

stable diffusion gpu 선택하기  (0) 2026.05.08
openai api .. 오디오 처리 성공  (0) 2026.05.07
RAG, 시도 (openAI 사용하지 않는 방향)  (0) 2026.05.07
opencode + llama.cpp  (0) 2026.05.06
llm tts?  (0) 2026.05.05
Posted by 구차니