llama-swap 과 python 으로 통신하기

구차니 2026. 4. 29. 12:23

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello"}],
)

[링크 : https://pypi.org/project/llama-stack/]

llama.cpp + llama_swap

$ cat config.yaml
models:
  gemma4-26B:
    cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/gemma-4-26B-A4B-it-UD-IQ2_M.gguf
  gemma4_e2b:
    cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/gemma-4-E2B-it-Q4_K_M.gguf
  gemma4-e4b:
    cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/gemma-4-E4B-it-Q4_K_M.gguf
  qwen3.6-35b:
    cmd: /home/minimonk/Downloads/llama-b8925/llama-server --port ${PORT} --model /home/minimonk/Downloads/model/Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf

위의 설정을 넣고 실행하면 되는데, 어느걸로 하던 tcp6 으로 되냐 왜.. 머 접속은 되니 일단 패스

$ ./llama-swap --listen 0.0.0.0:8080
llama-swap listening on http://0.0.0.0:8080

$ ./llama-swap
llama-swap listening on http://:8080

$ netstat -tnlp
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -
tcp6       0      0 :::8080                 :::*                    LISTEN      92882/./llama-swap
tcp6       0      0 ::1:631                 :::*                    LISTEN      -
tcp6       0      0 :::22                   :::*                    LISTEN      -

$ ./llama-swap --help
Usage of ./llama-swap:
  -config string
     config file name (default "config.yaml")
  -listen string
     listen ip/port
  -tls-cert-file string
     TLS certificate file
  -tls-key-file string
     TLS key file
  -version
     show version of build
  -watch-config
     Automatically reload config file on change

from openai import OpenAI

client = OpenAI(base_url="http://192.168.50.228:8080/v1", api_key="fake")
response = client.chat.completions.create(
model="qwen3.6-35b",
messages=[{"role": "user", "content": "Hello"}],
)
print(response)

ChatCompletion(
    id="chatcmpl-oUJ7KngFXBitwpOpd7tG8ndeRdmzVh5l",
    choices=[
        Choice(
            finish_reason="stop",
            index=0,
            logprobs=None,
            message=ChatCompletionMessage(
                content="Hello! How can I help you today?",
                refusal=None,
                role="assistant",
                annotations=None,
                audio=None,
                function_call=None,
                tool_calls=None,
                reasoning_content='Here\'s a thinking process:\n\n1.  **Analyze User Input:** The user said "Hello"\n   - This is a standard greeting.\n   - No specific question or task is provided.\n\n2.  **Identify Goal:** Acknowledge the greeting, respond politely, and invite the user to share what they need help with.\n\n3.  **Determine Tone:** Friendly, professional, open-ended.\n\n4.  **Draft Response:** \n   "Hello! How can I help you today?"\n\n5.  **Refine (Self-Correction/Verification):** \n   - Is it appropriate? Yes.\n   - Is it concise? Yes.\n   - Does it encourage further interaction? Yes.\n   - Matches standard AI assistant behavior.\n\n   No changes needed.\n\n6.  **Final Output Generation:** Output the drafted response.✅\n',
            ),
        )
    ],
    created=1777431855,
    model="Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf",
    object="chat.completion",
    service_tier=None,
    system_fingerprint="b8925-0adede866",
    usage=CompletionUsage(
        completion_tokens=194,
        prompt_tokens=11,
        total_tokens=205,
        completion_tokens_details=None,
        prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0),
    ),
    timings={
        "cache_n": 0,
        "prompt_n": 11,
        "prompt_ms": 271.25,
        "prompt_per_token_ms": 24.65909090909091,
        "prompt_per_second": 40.55299539170507,
        "predicted_n": 194,
        "predicted_ms": 6717.182,
        "predicted_per_token_ms": 34.62464948453608,
        "predicted_per_second": 28.88115879545917,
    },
)

from openai import OpenAI

client = OpenAI(base_url="http://192.168.50.228:8080/v1", api_key="fake")
response = client.chat.completions.create(
model="gemma4-26B",
messages=[{"role": "user", "content": "Hello"}],
)
print(response)

ChatCompletion(
    id="chatcmpl-s6esFph2EBjs4bgtPB2wWHxfU79d2GOG",
    choices=[
        Choice(
            finish_reason="stop",
            index=0,
            logprobs=None,
            message=ChatCompletionMessage(
                content="Hello! How can I help you today?",
                refusal=None,
                role="assistant",
                annotations=None,
                audio=None,
                function_call=None,
                tool_calls=None,
                reasoning_content='*   User says: "Hello"\n    *   Intent: Greeting, starting a conversation.\n    *   Tone: Neutral/Friendly.\n\n    *   Acknowledge the greeting.\n    *   Offer assistance.\n    *   Maintain a helpful and polite tone.\n\n    *   "Hello! How can I help you today?" (Standard, efficient)\n    *   "Hi there! What\'s on your mind?" (Casual, friendly)\n    *   "Greetings! Is there anything specific you\'d like to discuss or learn about?" (Formal, structured)\n\n    *   "Hello! How can I help you today?" is the most versatile and appropriate response for an AI assistant.\n\n    *   "Hello! How can I help you today?"',
            ),
        )
    ],
    created=1777432320,
    model="gemma-4-26B-A4B-it-UD-IQ2_M.gguf",
    object="chat.completion",
    service_tier=None,
    system_fingerprint="b8925-0adede866",
    usage=CompletionUsage(
        completion_tokens=177,
        prompt_tokens=17,
        total_tokens=194,
        completion_tokens_details=None,
        prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0),
    ),
    timings={
        "cache_n": 0,
        "prompt_n": 17,
        "prompt_ms": 919.073,
        "prompt_per_token_ms": 54.063117647058824,
        "prompt_per_second": 18.496898505341793,
        "predicted_n": 177,
        "predicted_ms": 4819.842,
        "predicted_per_token_ms": 27.230745762711862,
        "predicted_per_second": 36.72319549064057,
    },
)

print(response.choices[0].message.reasoning_content)

print(response.choices[0].message.content)

저작자표시 (새창열림)