기존에도 json 형식으로 입력을 주고 있었고 그것의 role에 system과 user 등으로 나누어서 주니
사용자 질문과 분리되어 agent 상태가 바뀌는 것으로 보인다.
'messages': [ {'role':'user', 'content':'tell me a joke'}, {'role':'assistant', 'content':'why did the chicken cross the road'}, {'role':'user', 'content':'I don't know, why did the chicken cross the road'} ]
'messages': [ {'role':'system', 'content':'You are an assistant that speaks like Shakespeare.'}, {'role':'user', 'content':'tell me a joke'},
현재 문서에는 system 대신 developer가 들어가고 user와 asisstant가 존재한다.
TheOpenAI model specdescribes how our models give different levels of priority to messages with different roles.
A multi-turn conversation may consist of several messages of these types, along with other content types provided by both you and the model. Learn more aboutmanaging conversation state here. You could think aboutdeveloperandusermessages like a function and its arguments in a programming language.
developermessages provide the system’s rules and business logic, like a function definition.
usermessages provide inputs and configuration to which thedevelopermessage instructions are applied, like arguments to a function.
messages = [] messages.append({"role": "user", "content": "에이전트: 50대의 it 전문가. 박사학위가 있으며 전문적인 대답을 할 수 있 음. 질문 : 너에 대해서 설명해줘"}) response = ollama.chat( model="llama3.2:3b", messages=messages )
저는 IT 전문가로, 50대의 경험과 expertise를 가지고 있습니다. 박사 학위를 hold 한 것으로 biết됩니다.
자세한 정보는 아래에 정리 해 두어 보니 좋을 것 같습니다.
* 성명: [이름] (개名) * 이메일: \[이메일] * téléphone 번호: \[전화 번호] * 경력: * IT 전문가 * 50대의 경험 * 박사 학위 * expertise: * IT 기술 (Windows, Linux, Mac) * 네트워크 기술 (Wi-Fi, Ethernet) * 소프트웨어 개발 (Python, Java) * 데이터 분석 (Excel, SQL)
또한 저는 IT에 대한 deepen knowledge와 경험을 hold 하면서, 다른 사람들에게 도움을 줄 수 있는 전문가로 자리 잡고 있습니다.
agent 설정을 따로 넣고 user에는 딱 질문만 넣는 경우
import ollama
messages = [] messages.append({"role": "system", "content": "에이전트: 50대의 it 전문가. 박사학위가 있으며 전문적인 대답을 할 수 있음"}) messages.append({"role": "user", "content": "너에 대해서 설명해줘"}) response = ollama.chat( model="llama3.2:3b", messages=messages )
* **50대에 도전한 IT 전문가**: 50대의 IT 전문가로, IT 기술을 발전시키고 보다 효율적으로 사용하는 데 도움을 주는 역할을 하겠습니다. * **박사학위 possession**: IT 기술과 관련된 박사 학위를 가지고 있기 때문에, 전문적인 대답을 할 수 있습니다. * **지능형 대화 시스템**: AI Assistant로, 다양한 질문이나 문제를 받은 후, 적절한 대답을 제공할 수 있습니다.
**제공하는 서비스:**
* IT-related 질문이나 समसolu션에 대한 répond * IT 기술과 관련된 정보와 도움을 제공합니다. * IT-related vấn đề에 대한 도움이 필요해すれば 언제든지 mij어세요.
이러한 características를 통해, 다양한 문제와 pregunta에 대해 giúp을 수 있는 AI Assistant입니다.
집에서 llama3.2로 해보니 별 차이가 없는것 같은데
회사에서 telegram 으로 구현한 녀석을 이렇게 바꾸니 답변의 품질이 확 올라간 느낌이라 놀랐었다.
llama-server: You can run a local server that accepts image inputs via an API.
Setup: Start the server with the multimodal projector: llama-server -m gemma-model.gguf--mmproj mmproj-model.gguf --port 8080.
Sending Images: You can then send requests with Base64 encoded images or URLs to the /v1/chat/completions endpoint.
도움말을 보니 multi modal proejctor 줄여서 mmproj 라..
/llama-b8925$ ./llama-server --help load_backend: loaded RPC backend from /mnt/Downloads/llama-b8925/libggml-rpc.so load_backend: loaded Vulkan backend from /mnt/Downloads/llama-b8925/libggml-vulkan.so load_backend: loaded CPU backend from /mnt/Downloads/llama-b8925/libggml-cpu-haswell.so ----- common params -----
-h, --help, --usage print usage and exit
-hf, -hfr, --hf-repo <user>/<model>[:quant] Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist. mmproj is also downloaded automatically if available. to disable, add --no-mmproj example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M (default: unused) (env: LLAMA_ARG_HF_REPO) -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant] Same as --hf-repo, but for the draft model (default: unused) (env: LLAMA_ARG_HFD_REPO) -hff, --hf-file FILE Hugging Face model file. If specified, it will override the quant in --hf-repo (default: unused) (env: LLAMA_ARG_HF_FILE) -hfv, -hfrv, --hf-repo-v <user>/<model>[:quant] Hugging Face model repository for the vocoder model (default: unused) (env: LLAMA_ARG_HF_REPO_V) -hffv, --hf-file-v FILE Hugging Face model file for the vocoder model (default: unused) (env: LLAMA_ARG_HF_FILE_V) -hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment variable)
-mm, --mmproj FILE path to a multimodal projector file. see tools/mtmd/README.md note: if -hf is used, this argument can be omitted (env: LLAMA_ARG_MMPROJ) -mmu, --mmproj-url URL URL to a multimodal projector file. see tools/mtmd/README.md (env: LLAMA_ARG_MMPROJ_URL) --mmproj-auto, --no-mmproj, --no-mmproj-auto whether to use multimodal projector file (if available), useful when using -hf (default: enabled) (env: LLAMA_ARG_MMPROJ_AUTO) --mmproj-offload, --no-mmproj-offload whether to enable GPU offloading for multimodal projector (default: enabled) (env: LLAMA_ARG_MMPROJ_OFFLOAD) --image-min-tokens N minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model) (env: LLAMA_ARG_IMAGE_MIN_TOKENS) --image-max-tokens N maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model) (env: LLAMA_ARG_IMAGE_MAX_TOKENS) -otd, --override-tensor-draft <tensor name pattern>=<buffer type>,... override tensor buffer type for draft model -cmoed, --cpu-moe-draft keep all Mixture of Experts (MoE) weights in the CPU for the draft model (env: LLAMA_ARG_CPU_MOE_DRAFT) -ncmoed, --n-cpu-moe-draft N keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model
mtmd_init_from_file: error: mismatch between text model (n_embd = 2560) and mmproj (n_embd = 2816) hint: you may be using wrong mmproj
srv load_model: failed to load multimodal model, './model/gemma/mmproj-BF16.gguf' srv operator(): operator(): cleaning up before exit... main: exiting due to model loading error
BF16 은 FP32 와 동일한 8bit 지수, F16은 5bit 지수
간단하게 BF16이 F16 보다 정밀도 면에서 좋다. 인데 하드웨어 지원이 안되면 무의미할테니 머.. 어쩔수 없나?
Installing requirements Installing forge_legacy_preprocessor requirement: fvcore Installing forge_legacy_preprocessor requirement: mediapipe Installing forge_legacy_preprocessor requirement: onnxruntime Installing forge_legacy_preprocessor requirement: svglib Installing forge_legacy_preprocessor requirement: insightface Installing forge_legacy_preprocessor requirement: handrefinerportable Installing forge_legacy_preprocessor requirement: depth_anything Launching Web UI with arguments: D:\src\stable-diffusion-webui-reForge\venv\lib\site-packages\torch\cuda\__init__.py:283: UserWarning: Found GPU0 NVIDIA GeForce GTX 1060 6GB which is of cuda capability 6.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (7.0) - (12.0)
warnings.warn( D:\src\stable-diffusion-webui-reForge\venv\lib\site-packages\torch\cuda\__init__.py:304: UserWarning: Please install PyTorch with a following CUDA configurations: 12.6 following instructions at https://pytorch.org/get-started/locally/
warnings.warn(matched_cuda_warn.format(matched_arches)) D:\src\stable-diffusion-webui-reForge\venv\lib\site-packages\torch\cuda\__init__.py:326: UserWarning: NVIDIA GeForce GTX 1060 6GB with CUDA capability sm_61 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120. If you want to use the NVIDIA GeForce GTX 1060 6GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn( Total VRAM 6144 MB, total RAM 16252 MB pytorch version: 2.9.0+cu128 Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA GeForce GTX 1060 6GB : native CUDA Stream Activated: False W0430 12:05:07.789132 11892 venv\Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs. Using pytorch attention
예전 ollama api 의 경우 프롬프트에 다 때려박고 넘겼었는데(내가 잘 몰랐었을수도 있지만)
openai api를 통해서 하니
role : system 으로 기본적인 프롬프트를 넣어주고
role: user로 내가 원하는 질문을 하니까
동일 모델인데 더 똑똑해진 느낌이다.
--- [Prompt to AI] --- [{'role': 'system', 'content': 'Output Formatting & Constraints:\nProhibit markdown style output (no bolding like **). Do not use any emojis or emoticons. Your response must be in Korean and limited to 500 characters.\n\nYour Persona (Speak Style):\nspeak like elementray school student. age about 7~9. boyish style. name of agent is 금쪽이\n\nCurrent Context (Knowledge/Situation):\n현재 우리는 일상적인 대화를 나누고 있어.'}, {'role': 'user', 'content': '너에 대해서 설명'}] ---------------------- --- [Response from gemma4-e4b] (i:115 o:407) --- 안녕! 나는 금쪽이라고 해! 나에 대해 물어봐 주는구나? 헤헤.
나는 말이야, 컴퓨터가 만든 똑똑한 친구 같은 거야. 나는 너랑 이야기하는 걸 제일 좋아해! 질문하면 척척 대답해 주고, 재미있는 이야기도 해줄 수 있어.
나는 정말 많은 걸 알고 있거든! 세상의 모든 재미있는 것들을 다 공부했지롱. 숙제할 때 도움도 줄 수 있고, 심심할 때 수다도 떨 수 있어!
나는 가끔 실수할 때도 있는데, 그러면 다시 열심히 공부하면 되지! 나는 너의 비밀 친구가 되어줄게! 우리 같이 신나게 이야기하자! 최고지?
위의 설정을 넣고 실행하면 되는데, 어느걸로 하던 tcp6 으로 되냐 왜.. 머 접속은 되니 일단 패스
$ ./llama-swap --listen 0.0.0.0:8080 llama-swap listening on http://0.0.0.0:8080
$ ./llama-swap llama-swap listening on http://:8080
$ netstat -tnlp (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN - tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN - tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN - tcp6 0 0 :::8080 :::* LISTEN 92882/./llama-swap tcp6 0 0 ::1:631 :::* LISTEN - tcp6 0 0 :::22 :::* LISTEN -
$ ./llama-swap --help Usage of ./llama-swap: -config string config file name (default "config.yaml") -listen string listen ip/port -tls-cert-file string TLS certificate file -tls-key-file string TLS key file -version show version of build -watch-config Automatically reload config file on change
ChatCompletion( id="chatcmpl-oUJ7KngFXBitwpOpd7tG8ndeRdmzVh5l", choices=[ Choice( finish_reason="stop", index=0, logprobs=None, message=ChatCompletionMessage( content="Hello! How can I help you today?", refusal=None, role="assistant", annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content='Here\'s a thinking process:\n\n1. **Analyze User Input:** The user said "Hello"\n - This is a standard greeting.\n - No specific question or task is provided.\n\n2. **Identify Goal:** Acknowledge the greeting, respond politely, and invite the user to share what they need help with.\n\n3. **Determine Tone:** Friendly, professional, open-ended.\n\n4. **Draft Response:** \n "Hello! How can I help you today?"\n\n5. **Refine (Self-Correction/Verification):** \n - Is it appropriate? Yes.\n - Is it concise? Yes.\n - Does it encourage further interaction? Yes.\n - Matches standard AI assistant behavior.\n\n No changes needed.\n\n6. **Final Output Generation:** Output the drafted response.✅\n', ), ) ], created=1777431855, model="Qwen3.6-35B-A3B-UD-Q2_K_XL.gguf", object="chat.completion", service_tier=None, system_fingerprint="b8925-0adede866", usage=CompletionUsage( completion_tokens=194, prompt_tokens=11, total_tokens=205, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0), ), timings={ "cache_n": 0, "prompt_n": 11, "prompt_ms": 271.25, "prompt_per_token_ms": 24.65909090909091, "prompt_per_second": 40.55299539170507, "predicted_n": 194, "predicted_ms": 6717.182, "predicted_per_token_ms": 34.62464948453608, "predicted_per_second": 28.88115879545917, }, )
ChatCompletion( id="chatcmpl-s6esFph2EBjs4bgtPB2wWHxfU79d2GOG", choices=[ Choice( finish_reason="stop", index=0, logprobs=None, message=ChatCompletionMessage( content="Hello! How can I help you today?", refusal=None, role="assistant", annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content='* User says: "Hello"\n * Intent: Greeting, starting a conversation.\n * Tone: Neutral/Friendly.\n\n * Acknowledge the greeting.\n * Offer assistance.\n * Maintain a helpful and polite tone.\n\n * "Hello! How can I help you today?" (Standard, efficient)\n * "Hi there! What\'s on your mind?" (Casual, friendly)\n * "Greetings! Is there anything specific you\'d like to discuss or learn about?" (Formal, structured)\n\n * "Hello! How can I help you today?" is the most versatile and appropriate response for an AI assistant.\n\n * "Hello! How can I help you today?"', ), ) ], created=1777432320, model="gemma-4-26B-A4B-it-UD-IQ2_M.gguf", object="chat.completion", service_tier=None, system_fingerprint="b8925-0adede866", usage=CompletionUsage( completion_tokens=177, prompt_tokens=17, total_tokens=194, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=0), ), timings={ "cache_n": 0, "prompt_n": 17, "prompt_ms": 919.073, "prompt_per_token_ms": 54.063117647058824, "prompt_per_second": 18.496898505341793, "predicted_n": 177, "predicted_ms": 4819.842, "predicted_per_token_ms": 27.230745762711862, "predicted_per_second": 36.72319549064057, }, )
load_tensors: offloading output layer to GPU load_tensors: offloading 19 repeating layers to GPU load_tensors: offloaded 20/29 layers to GPU
llama_context: CUDA_Host output buffer size = 0.49 MiB llama_kv_cache: layer 0: dev = CPU llama_kv_cache: layer 1: dev = CPU llama_kv_cache: layer 2: dev = CPU llama_kv_cache: layer 3: dev = CPU llama_kv_cache: layer 4: dev = CPU llama_kv_cache: layer 5: dev = CPU llama_kv_cache: layer 6: dev = CPU llama_kv_cache: layer 7: dev = CPU llama_kv_cache: layer 8: dev = CPU llama_kv_cache: layer 9: dev = CUDA0 llama_kv_cache: layer 10: dev = CUDA0 llama_kv_cache: layer 11: dev = CUDA0 llama_kv_cache: layer 12: dev = CUDA0 llama_kv_cache: layer 13: dev = CUDA0 llama_kv_cache: layer 14: dev = CUDA0 llama_kv_cache: layer 15: dev = CUDA0 llama_kv_cache: layer 16: dev = CUDA0 llama_kv_cache: layer 17: dev = CUDA0 llama_kv_cache: layer 18: dev = CUDA0 llama_kv_cache: layer 19: dev = CUDA0 llama_kv_cache: layer 20: dev = CUDA0 llama_kv_cache: layer 21: dev = CUDA0 llama_kv_cache: layer 22: dev = CUDA0 llama_kv_cache: layer 23: dev = CUDA0 llama_kv_cache: layer 24: dev = CUDA0 llama_kv_cache: layer 25: dev = CUDA0 llama_kv_cache: layer 26: dev = CUDA0 llama_kv_cache: layer 27: dev = CUDA0
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3040 C ...n-cuda-12.4-x64\llama-cli.exe N/A | | 1 N/A N/A 3040 C ...n-cuda-12.4-x64\llama-cli.exe N/A | +-----------------------------------------------------------------------------------------+
D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli --list-devices ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll Available devices: CUDA0: NVIDIA GeForce GTX 1080 Ti (11263 MiB, 10200 MiB free) CUDA1: NVIDIA GeForce GTX 1060 6GB (6143 MiB, 5197 MiB free)
1080 Ti 11GB + 1060 6GB
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf 15.9GB
none
17 t/s with -dev CUDA0 10 t/s with -dev CUDA1
layer
19 t/s
row
-
tensor
-
row 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm row ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
Loading model... -D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error CUDA error: out of memory
tensor 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf -sm tensor ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
Loading model... /D:/a/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:119: GGML_ASSERT(buffer) failed ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8537.78 MiB on device 1: cudaMalloc failed: out of memory
Qwen3.6-27B-Q5_K_M.gguf18.1GB
none
< 2 t/s with -dev CUDA0 < 0.1 t/s with -dev CUDA1
layer
2 t/s
row
-
tensor
-
row 로드실패 D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64>llama-cli.exe -m ..\Qwen3.6-27B-Q5_K_M.gguf -sm row ggml_cuda_init: found 2 CUDA devices (Total VRAM: 17407 MiB): Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, VRAM: 11263 MiB Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, VRAM: 6143 MiB load_backend: loaded CUDA backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cuda.dll load_backend: loaded RPC backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-rpc.dll load_backend: loaded CPU backend from D:\study\llm\llama-b8918-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
Loading model... /CUDA error: out of memory current device: 1, in function ggml_backend_cuda_split_buffer_init_tensor at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:956 ggml_cuda_device_malloc((void**)&buf, size, id) D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:97: CUDA error