프로그램 사용/ai 프로그램
llama-swap 과 python 으로 통신하기
구차니
2026. 4. 29. 12:23
| from openai import OpenAI client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake") response = client.chat.completions.create( model="llama-3.3-70b", messages=[{"role": "user", "content": "Hello"}], ) |
[링크 : https://pypi.org/project/llama-stack/]
llama.cpp + llama_swap
| $ cat config.yaml models: gemma4-26B: cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/gemma-4-26B-A4B-it-UD-IQ2_M.gguf gemma4_e2b: cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/gemma-4-E2B-it-Q4_K_M.gguf gemma4-e4b: cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/gemma-4-E4B-it-Q4_K_M.gguf qwen3.6-35b: cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf |
위의 설정을 넣고 실행하면 되는데, 어느걸로 하던 tcp6 으로 되냐 왜.. 머 접속은 되니 일단 패스
| $ ./llama-swap --listen 0.0.0.0:8080 llama-swap listening on http://0.0.0.0:8080 |
| $ ./llama-swap llama-swap listening on http://:8080 |
| $ netstat -tnlp (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN - tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN - tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN - tcp6 0 0 :::8080 :::* LISTEN 92882/./llama-swap tcp6 0 0 ::1:631 :::* LISTEN - tcp6 0 0 :::22 :::* LISTEN - |
| $ ./llama-swap --help Usage of ./llama-swap: -config string config file name (default "config.yaml") -listen string listen ip/port -tls-cert-file string TLS certificate file -tls-key-file string TLS key file -version show version of build -watch-config Automatically reload config file on change |
| from openai import OpenAI client = OpenAI(base_url="http://192.168.50.228:8080/v1", api_key="fake") response = client.chat.completions.create( model="qwen3.6-35b", messages=[{"role": "user", "content": "Hello"}], ) print(response) |
| ChatCompletion( id="chatcmpl-oUJ7KngFXBitwpOpd7tG8ndeRdmzVh5l", choices=[ Choice( finish_reason="stop", index=0, logprobs=None, message=ChatCompletionMessage( content="Hello! How can I help you today?", refusal=None, role="assistant", annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content='Here\'s a thinking process:\n\n1. **Analyze User Input:** The user said "Hello"\n - This is a standard greeting.\n - No specific question or task is provided.\n\n2. **Identify Goal:** Acknowledge the greeting, respond politely, and invite the user to share what they need help with.\n\n3. **Determine Tone:** Friendly, professional, open-ended.\n\n4. **Draft Response:** \n "Hello! How can I help you today?"\n\n5. **Refine (Self-Correction/Verification):** \n - Is it appropriate? Yes.\n - Is it concise? Yes.\n - Does it encourage further interaction? Yes.\n - Matches standard AI assistant behavior.\n\n No changes needed.\n\n6. **Final Output Generation:** Output the drafted response.✅\n', ), ) ], created=1777431855, model="Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf", object="chat.completion", service_tier=None, system_fingerprint="b8925-0adede866", usage=CompletionUsage( completion_tokens=194, prompt_tokens=11, total_tokens=205, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0), ), timings={ "cache_n": 0, "prompt_n": 11, "prompt_ms": 271.25, "prompt_per_token_ms": 24.65909090909091, "prompt_per_second": 40.55299539170507, "predicted_n": 194, "predicted_ms": 6717.182, "predicted_per_token_ms": 34.62464948453608, "predicted_per_second": 28.88115879545917, }, ) |
| from openai import OpenAI client = OpenAI(base_url="http://192.168.50.228:8080/v1", api_key="fake") response = client.chat.completions.create( model="gemma4-26B", messages=[{"role": "user", "content": "Hello"}], ) print(response) |
| ChatCompletion( id="chatcmpl-s6esFph2EBjs4bgtPB2wWHxfU79d2GOG", choices=[ Choice( finish_reason="stop", index=0, logprobs=None, message=ChatCompletionMessage( content="Hello! How can I help you today?", refusal=None, role="assistant", annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content='* User says: "Hello"\n * Intent: Greeting, starting a conversation.\n * Tone: Neutral/Friendly.\n\n * Acknowledge the greeting.\n * Offer assistance.\n * Maintain a helpful and polite tone.\n\n * "Hello! How can I help you today?" (Standard, efficient)\n * "Hi there! What\'s on your mind?" (Casual, friendly)\n * "Greetings! Is there anything specific you\'d like to discuss or learn about?" (Formal, structured)\n\n * "Hello! How can I help you today?" is the most versatile and appropriate response for an AI assistant.\n\n * "Hello! How can I help you today?"', ), ) ], created=1777432320, model="gemma-4-26B-A4B-it-UD-IQ2_M.gguf", object="chat.completion", service_tier=None, system_fingerprint="b8925-0adede866", usage=CompletionUsage( completion_tokens=177, prompt_tokens=17, total_tokens=194, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0), ), timings={ "cache_n": 0, "prompt_n": 17, "prompt_ms": 919.073, "prompt_per_token_ms": 54.063117647058824, "prompt_per_second": 18.496898505341793, "predicted_n": 177, "predicted_ms": 4819.842, "predicted_per_token_ms": 27.230745762711862, "predicted_per_second": 36.72319549064057, }, ) |
| print(response.choices[0].message.reasoning_content) |
| print(response.choices[0].message.content) |