url만 요청하면 404 나오고(웹브라우저에서도 404 뜸)

$ curl http://192.168.40.238:8080/v1/chat/completions -i
HTTP/1.1 404 Not Found
Date: Wed, 10 Jun 2026 04:59:30 GMT
Content-Length: 0

 

잘못된 모델로 요청해도 배째고(최소한 응답은 있네!)

$ curl http://192.168.40.238:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-api-key" \
  -d '{
    "model": "qwen-coder",
    "messages": [
      {
        "role": "user",
        "content": "Python으로 HTTP 서버 예제를 작성해줘."
      }
    ]
  }'
could not find suitable inference handler for qwen-coder

 

제대로 된 요청으로 돌어와야 응답을 해주는 깐깐한 녀석

$ curl http://192.168.40.238:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-api-key" \
  -d '{
    "model": "gemma4-e4b",
    "messages": [
      {
        "role": "user",
        "content": "Python으로 HTTP 서버 예제를 작성해줘."
      }
    ]
  }'

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Python은 표준 라이브러리만으로도 매우 빠르고 간단하게 HTTP 서버를 만들 수 있게 해줍니다.\n\n요구 사항에 따라 두 가지 예제를 제공해 드립니다.\n\n1. **가장 간단한 예제 (파일 서빙):** Python의 내장 `http.server` 모듈을 사용하는 방법. (가장 빠르게 서버를 띄우는 방법)\n2. **커스터마이징 예제 (맞춤 응답):** `BaseHTTPRequestHandler`를 상속받아 GET 요청 등에 대해 직접 응답 내용을 정의하는 방법. (실제 웹 개발의 기본 원리 학습)\n\n---\n\n## 💡 예제 1: 가장 간단한 HTTP 서버 (파일 서빙)\n\n이 방법은 실제 웹 프레임워크를 사용하지 않고, 현재 디렉터리에 있는 파일들을 웹에서 불러올 때(서빙할 때) 사용됩니다. 서버를 띄우는 코드를 작성할 필요 없이, 명령어 한 줄로 가능합니다.\n\n### 💻 코드 (실행 명령어)\n\n```bash\n# Python 3 환경에서 실행\npython3 -m http.server 8000\n```\n\n### 🚀 실행 방법\n\n1. 서버로 사용할 파일들(HTML, 이미지 등)이 있는 디렉터리로 이동합니다.\n2. 위의 명령어를 실행합니다.\n3. 브라우저를 열고 `http://localhost:8000` 에 접속하면 해당 디렉터리의 파일 목록이 보입니다.\n\n**✅ 장점:** 코드가 거의 필요 없고, 즉시 작동합니다.\n**❌ 단점:** 특정 URL 요청에 대해 복잡한 비즈니스 로직을 처리할 수 없습니다.\n\n---\n\n## 🛠️ 예제 2: 맞춤 응답을 처리하는 HTTP 서버 (BaseHTTPRequestHandler 사용)\n\n만약 특정 URL(`http://localhost:8000/hello`)에 접속했을 때, **사용자 정의 메시지**를 반환하고 싶다면 이 방식을 사용해야 합니다.\n\n이 코드는 `http.server` 모듈의 기본 핸들러 클래스를 상속받아, `do_GET` 메소드를 오버라이드(재정의)하여 우리가 원하는 응답을 직접 구성합니다.\n\n### 💻 코드 (`custom_server.py`)\n\n```python\nimport http.server\nimport socketserver\n\n# 사용할 포트 지정\nPORT = 8000\n\nclass CustomHTTPRequestHandler(http.server.BaseHTTPRequestHandler):\n    \"\"\"\n    HTTP 요청(GET, POST 등)을 처리하는 사용자 정의 핸들러 클래스입니다.\n    \"\"\"\n    \n    def do_GET(self):\n        \"\"\"\n        GET 요청이 들어왔을 때 실행되는 메서드입니다.\n        \"\"\"\n        print(f\"--- GET 요청 수신: {self.path} ---\")\n        \n        # 1. 특정 경로에 대한 응답 처리\n        if self.path == '/hello':\n            # 응답할 콘텐츠\n            response_content = \"안녕하세요! 이것은 커스텀 서버가 보낸 응답입니다.\"\n            \n            # HTTP 응답 헤더 설정 (상태 코드: 200 OK, Content-Type: text/plain)\n            self.send_response(200)\n            self.send_header(\"Content-type\", \"text/plain\")\n            self.send_header(\"Content-Length\", str(len(response_content)))\n            self.end_headers()\n            \n            # 응답 본문(Body) 전송\n            self.wfile.write(bytes(response_content, \"utf-8\"))\n            \n        # 2. 기본 경로 (/)에 대한 응답 처리\n        elif self.path == '/':\n            response_content = \"환영합니다! /hello 경로로 접속해 보세요.\"\n            self.send_response(200)\n            self.send_header(\"Content-type\", \"text/html\")\n            self.end_headers()\n            self.wfile.write(bytes(response_content, \"utf-8\"))\n            \n        # 3. 정의되지 않은 경로에 대한 응답 (404 Not Found)\n        else:\n            self.send_response(404)\n            self.send_header(\"Content-type\", \"text/plain\")\n            self.end_headers()\n            self.wfile.write(bytes(\"404 Not Found\", \"utf-8\"))\n\n\n# 서버를 띄우는 설정\nHandler = CustomHTTPRequestHandler\n\nwith socketserver.TCPServer((\"\", PORT), Handler) as httpd:\n    print(f\"*** 서버가 http://localhost:{PORT} 에서 실행 중입니다. ***\")\n    print(\"테스트 경로: http://localhost:8000/hello\")\n    print(\"---------------------------------------------------\")\n    try:\n        httpd.serve_forever()\n    except KeyboardInterrupt:\n        print(\"\\n서버를 종료합니다.\")\n        httpd.server_close()\n```\n\n### 🚀 실행 방법\n\n1. 위 코드를 `custom_server.py` 파일로 저장합니다.\n2. 터미널에서 다음 명령어를 실행합니다.\n   ```bash\n   python3 custom_server.py\n   ```\n3. **테스트:**\n   * `http://localhost:8000/` 에 접속하면 \"환영합니다!\"가 보입니다.\n   * `http://localhost:8000/hello` 에 접속하면 \"안녕하세요! 이것은 커스텀 서버가 보낸 응답입니다.\"가 보입니다.\n\n### ✨ 동작 원리 해설\n\n* **`socketserver`**: 네트워크 소켓을 관리하고 클라이언트 연결을 받아주는 역할을 합니다.\n* **`BaseHTTPRequestHandler`**: HTTP 요청을 파싱하고 응답을 보낼 수 있는 기본 템플릿을 제공합니다.\n* **`do_GET(self)`**: 클라이언트가 `GET` 요청을 보냈을 때 이 함수가 실행됩니다.\n* **`self.send_response(200)`**: 클라이언트에게 응답 상태 코드(200=성공, 404=찾을 수 없음 등)를 보냅니다.\n* **`self.send_header(...)`**: 응답의 종류(Content-type)와 크기(Content-Length) 등의 메타데이터를 보냅니다.\n* **`self.wfile.write(...)`**: 실제 웹 브라우저가 읽을 수 있는 응답 내용(HTML, 텍스트 등)을 전송합니다.\n\n---\n\n## 📚 요약 및 추천\n\n| 구분 | 예제 1: `http.server` | 예제 2: `BaseHTTPRequestHandler` |\n| :--- | :--- | :--- |\n| **용도** | 정적 파일(Static File) 서빙 | 특정 URL에 대한 동적 응답 처리 |\n| **난이도** | ⭐ (매우 쉬움) | ⭐⭐ (중간) |\n| **특징** | 코드 없이 명령어 실행 | HTTP 프로토콜을 직접 제어 |\n| **추천 상황** | 간단한 테스트 서버, 파일 배포 | **웹 개발의 기초 원리 학습** |\n| **실무 사용** | 파일 공유 시 | **Flask/Django 같은 프레임워크 학습의 출발점** |\n\n**👉 만약 실제 서비스를 만들고 싶다면:** 예제 2로 기초 원리를 이해하신 후, **Flask**나 **FastAPI** 같은 경량 프레임워크를 사용하시는 것을 강력히 추천드립니다. 이 프레임워크들이 위의 복잡한 `do_GET` 코드를 훨씬 간결하게 처리해 줍니다.","reasoning_content":"Here's a thinking process that leads to the suggested response:\n\n1.  **Analyze the Request:** The user wants an example of an HTTP server written in Python.\n\n2.  **Determine the Scope/Level of the Example:**\n    *   *Option 1: Simple, built-in module (`http.server`).* This is the easiest and fastest way to fulfill the request for a *basic* example. It demonstrates functionality without complex code.\n    *   *Option 2: Using the `socket` module.* This is much harder, requires handling HTTP protocols manually, and is overkill for a standard \"example.\"\n    *   *Option 3: Using a framework (Flask/Django).* While common in production, this isn't a *pure* Python HTTP server example using standard libraries, and it might confuse the user if they just want to see how the basic protocol works.\n\n3.  **Select the Best Approach (Option 1: `http.server` and Customization):**\n    *   The best approach is to first provide the simplest, runnable solution using `http.server`.\n    *   Then, provide a slightly more advanced example (inheriting from `BaseHTTPRequestHandler`) to show how one would customize the handling of different HTTP methods (GET, POST, etc.).\n\n4.  **Draft the Code (Simple Server):**\n    *   Need to import `http.server` and `socketserver`.\n    *   Need to define the port (e.g., 8000).\n    *   Need to run the server. (This is the quickest way to run a basic server.)\n\n5.  **Draft the Code (Customized Server - The \"Educational\" Example):**\n    *   Need to import `http.server` and `BaseHTTPRequestHandler`.\n    *   Define a class (e.g., `SimpleHTTPRequestHandler`).\n    *   Override the `do_GET` method.\n    *   In `do_GET`, write the HTTP response structure (Status Line, Headers, Body).\n    *   Handle the server startup using `http.server.HTTPServer`.\n\n6.  **Review and Refine the Explanation (Crucial Step):**\n    *   The explanation must be clear, easy to follow, and explain *why* the code works.\n    *   *Add Instructions:* How to save, how to run, and how to test (using a browser or `curl`).\n    *   *Structure:* Present the simple example first, followed by the advanced/customizable example.\n\n7.  **Final Polish (Self-Correction during drafting):** Make sure the custom server example *properly* sends the Content-Type and Content-Length so the browser displays it correctly. (The provided code handles this.)\n\n**(The resulting output matches the structured response provided below.)**"}}],"created":1781067647,"model":"gemma-4-E4B-it-Q4_K_M.gguf","system_fingerprint":"b9553-9e3b928fd","object":"chat.completion","usage":{"completion_tokens":2288,"prompt_tokens":26,"total_tokens":2314,"prompt_tokens_details":{"cached_tokens":6}},"id":"chatcmpl-Mbe6PZU7iilj6DVEI59YH4eoDksRPZXj","timings":{"cache_n":6,"prompt_n":20,"prompt_ms":328.529,"prompt_per_token_ms":16.42645,"prompt_per_second":60.8774263459238,"predicted_n":2288,"predicted_ms":40048.23,"predicted_per_token_ms":17.50359702797203,"predicted_per_second":57.13111415910266}}

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

stable diffusion 학습.. part 2?  (1) 2026.06.12
stable diffussion train 시도  (0) 2026.06.10
llama.cpp prompt 옵션  (0) 2026.06.10
exllama  (0) 2026.06.10
stable diffusion train  (0) 2026.06.10
Posted by 구차니

시스템 프롬프트를 (role:system) 웹으로는 해봤는데 cli 에서도 가능한지 찾아보는 중

파일에도 넣을수 있고, 미리 넣을수도 있지만 런타임 중에 변경은 불가능 한건가?

llama-b9553$ ./llama-cli --help | grep -i sys
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--mlock                                 force system to keep model in RAM rather than swapping or compressing
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        if run without this previously, it is recommended to drop the system
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
                                        hunyuan-vl, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
                                        llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
                                        hunyuan-vl, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
                                        llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,

 

llama-b9553$ ./llama-cli --help | grep -i prom
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
                                        number of threads to use during batch and prompt processing (default:
--verbose-prompt                        print a verbose prompt before generation (default: false)
--display-prompt, --no-display-prompt   whether to print prompt at generation (default: true)
-co,   --color [on|off|auto]            Colorize output to distinguish prompt and user input from generations
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
                                        will not be interactive if first turn is predefined with --prompt

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

stable diffussion train 시도  (0) 2026.06.10
curl로 llama-swap 에게 api로 요청하기  (0) 2026.06.10
exllama  (0) 2026.06.10
stable diffusion train  (0) 2026.06.10
gemma4-e4b it qat / gemma4-12b mtp on 1080 ti 11GB  (0) 2026.06.08
Posted by 구차니
프로그램 사용/gcc2026. 6. 10. 12:06

msvc의 md 옵션과는 의미가 다르고(claude에게 낚임!)

그냥 빌드시 디버깅용 플래그라고 해야하나..?

동적 링크시 어떤 파일을 참조하는지 .d 파일에 주로 쓰도록 하는 것 같다.

[링크 : https://dmake.tistory.com/26]

[링크 : https://m.blog.naver.com/wonmylover/220771036728]

 

-M
Instead of outputting the result of preprocessing, output a rule suitable for make describing the dependencies of the main source file. The preprocessor outputs one make rule containing the object file name for that source file, a colon, and the names of all the included files, including those coming from -include or -imacros command line options.

Unless specified explicitly (with -MT or -MQ), the object file name consists of the name of the source file with any suffix replaced with object file suffix and with any leading directory parts removed. If there are many included files then the rule is split into several lines using \-newline. The rule has no commands.
This option does not suppress the preprocessor's debug output, such as -dM. To avoid mixing such debug output with the dependency rules you should explicitly specify the dependency output file with -MF, or use an environment variable like DEPENDENCIES_OUTPUT . Debug output will still be sent to the regular output stream as normal.

Passing -M to the driver implies -E, and suppresses warnings with an implicit -w.

-MM
Like -M but do not mention header files that are found in system header directories, nor header files that are included, directly or indirectly, from such a header.

This implies that the choice of angle brackets or double quotes in an #include directive does not in itself determine whether that header will appear in -MM dependency output. This is a slight change in semantics from GCC versions 3.0 and earlier.
-MF file
When used with -M or -MM, specifies a file to write the dependencies to. If no -MF switch is given the preprocessor sends the rules to the same place it would have sent preprocessed output.
When used with the driver options -MD or -MMD, -MF overrides the default dependency output file.

-MG
In conjunction with an option such as -M requesting dependency generation, -MG assumes missing header files are generated files and adds them to the dependency list without raising an error. The dependency filename is taken directly from the "#include" directive without prepending any path. -MG also suppresses preprocessed output, as a missing header file renders this useless.

This feature is used in automatic updating of makefiles.
-MP
This option instructs CPP to add a phony target for each dependency other than the main file, causing each to depend on nothing. These dummy rules work around errors make gives if you remove header files without updating the Makefile to match.

This is typical output:
test.o: test.c test.h

test.h:
-MT target
Change the target of the rule emitted by dependency generation. By default CPP takes the name of the main input file, deletes any directory components and any file suffix such as .c, and appends the platform's usual object suffix. The result is the target.
An -MT option will set the target to be exactly the string you specify. If you want multiple targets, you can specify them as a single argument to -MT, or use multiple -MT options.

For example, -MT '$(objpfx)foo.o' might give

$(objpfx)foo.o: foo.c
-MQ target
Same as -MT, but it quotes any characters which are special to Make. -MQ '$(objpfx)foo.o' gives
$$(objpfx)foo.o: foo.c
The default target is automatically quoted, as if it were given with -MQ.
-MD
-MD is equivalent to -M -MF file, except that -E is not implied. The driver determines file based on whether an -o option is given. If it is, the driver uses its argument but with a suffix of .d, otherwise it takes the name of the input file, removes any directory components and suffix, and applies a .d suffix.

If -MD is used in conjunction with -E, any -o switch is understood to specify the dependency output file, but if used without -E, each -o is understood to specify a target object file.
Since -E is not implied, -MD can be used to generate a dependency output file as a side-effect of the compilation process.

-MMD
Like -MD except mention only user header files, not system header files.

[링크 : https://linux.die.net/man/1/gcc]

Posted by 구차니
프로그램 사용/gcc2026. 6. 10. 11:57

리눅스에서 gcc로 빌드하면 시스템 절대 경로라고 해야하나.

아래의 경우, /lib/x86_64-linux-gnu/ 의 경로에 있는 so들을 보도록 되어있는데 (LD_LIBRARY_PATH)

이걸 빌드 시에 위치 기준 상대 경로를 보게 하는 옵션 인듯.

so 파일과 실행파일을 같이 배포할때 쓰이려나?

 

$ ldd untitled
linux-vdso.so.1 (0x00007fff3f7a7000)
libQt5Widgets.so.5 => /lib/x86_64-linux-gnu/libQt5Widgets.so.5 (0x00007a58ffe00000)
libQt5Gui.so.5 => /lib/x86_64-linux-gnu/libQt5Gui.so.5 (0x00007a58ff600000)
libQt5Core.so.5 => /lib/x86_64-linux-gnu/libQt5Core.so.5 (0x00007a58ff000000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007a58fec00000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007a5900f63000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007a58fe800000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007a5900519000)
libGL.so.1 => /lib/x86_64-linux-gnu/libGL.so.1 (0x00007a58ffd79000)
libpng16.so.16 => /lib/x86_64-linux-gnu/libpng16.so.16 (0x00007a59004de000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007a5900f45000)
libharfbuzz.so.0 => /lib/x86_64-linux-gnu/libharfbuzz.so.0 (0x00007a58fef31000)
libmd4c.so.0 => /lib/x86_64-linux-gnu/libmd4c.so.0 (0x00007a59004cc000)
libdouble-conversion.so.3 => /lib/x86_64-linux-gnu/libdouble-conversion.so.3 (0x00007a58ffd64000)
libicui18n.so.70 => /lib/x86_64-linux-gnu/libicui18n.so.70 (0x00007a58fe400000)
libicuuc.so.70 => /lib/x86_64-linux-gnu/libicuuc.so.70 (0x00007a58fe205000)
libpcre2-16.so.0 => /lib/x86_64-linux-gnu/libpcre2-16.so.0 (0x00007a58ff576000)
libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x00007a58fee62000)
libglib-2.0.so.0 => /lib/x86_64-linux-gnu/libglib-2.0.so.0 (0x00007a58feac5000)
/lib64/ld-linux-x86-64.so.2 (0x00007a5900faa000)
libGLdispatch.so.0 => /lib/x86_64-linux-gnu/libGLdispatch.so.0 (0x00007a58fe747000)
libGLX.so.0 => /lib/x86_64-linux-gnu/libGLX.so.0 (0x00007a58ffd30000)
libfreetype.so.6 => /lib/x86_64-linux-gnu/libfreetype.so.6 (0x00007a58fe13d000)
libgraphite2.so.3 => /lib/x86_64-linux-gnu/libgraphite2.so.3 (0x00007a58ffd09000)
libicudata.so.70 => /lib/x86_64-linux-gnu/libicudata.so.70 (0x00007a58fc400000)
libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007a58fea4f000)
libX11.so.6 => /lib/x86_64-linux-gnu/libX11.so.6 (0x00007a58fc2c0000)
libbrotlidec.so.1 => /lib/x86_64-linux-gnu/libbrotlidec.so.1 (0x00007a58ffcfb000)
libxcb.so.1 => /lib/x86_64-linux-gnu/libxcb.so.1 (0x00007a58fee38000)
libbrotlicommon.so.1 => /lib/x86_64-linux-gnu/libbrotlicommon.so.1 (0x00007a58fea2c000)
libXau.so.6 => /lib/x86_64-linux-gnu/libXau.so.6 (0x00007a5900f37000)
libXdmcp.so.6 => /lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007a59004c4000)
libbsd.so.0 => /lib/x86_64-linux-gnu/libbsd.so.0 (0x00007a58ffce3000)
libmd.so.0 => /lib/x86_64-linux-gnu/libmd.so.0 (0x00007a58ff569000)

 

[링크 : https://velog.io/@wjddms206/RPATH-한번에-이해하기]

[링크 : https://stackoverflow.com/questions/6324131/rpath-origin-not-having-desired-effect]

[링크 : https://stackoverflow.com/questions/38058041/correct-usage-of-rpath-relative-vs-absolute]

[링크 : https://stackoverflow.com/questions/38058041/correct-usage-of-rpath-relative-vs-absolute]

Posted by 구차니

vLLM 처럼 먼가 복수의 gpu를 복수의 사용자에게 서빙하는 걸 찾는 중인데..

이거 맞...나?

 

[링크 : https://github.com/turboderp-org/exllamav2]

[링크 : https://github.com/turboderp-org/exllamav3]

[링크 : https://github.com/theroyallab/tabbyAPI/]  exllama의 백엔드

   [링크 : https://www.reddit.com/r/LocalLLaMA/comments/1ijw4l5/stop_wasting_your_multigpu_setup_with_llamacpp/]

Posted by 구차니

음.. 내 gpu가 버틸수 있을까? ㅋㅋ

 

그런데 내용 자체가 쉽진 않아 보여서 어떻게 해야하나 see wiki를 눌러서 내용 보는 중

[링크 : https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Textual-Inversion]

 

+

[링크 : https://joonojoono.tistory.com/19]

[링크 : https://www.internetmap.kr/entry/Automatic1111-GUI-Beginners-Guide] 학습을 제외한 내용이 알참 -_ㅠ

 

일반적인 학습(?) 방법으로 길게 학습하고 중간중간 체크포인트 백업.. 용량 어쩔 ㅠㅠ

[링크 : https://www.reddit.com/r/StableDiffusion/comments/zw7qzo/automatic1111_dreambooth_how_to_continue_training/?tl=ko]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama.cpp prompt 옵션  (0) 2026.06.10
exllama  (0) 2026.06.10
gemma4-e4b it qat / gemma4-12b mtp on 1080 ti 11GB  (0) 2026.06.08
nvidia 3070 8GB 테스트 gemma4-e4b  (0) 2026.06.08
sigLIP, CLIP  (0) 2026.06.05
Posted by 구차니

요약

QAT는 생성속도 차이는 크게 없어 보임. 사용해봐야 결과 품질을 알 수 있을 듯 함.

MTP는 50% 정도 성능 향상이 되는 듯?

---

QAT

오오 3~4일 전 따끈한 모델!

용량이 3~4GB 정도라 정말 어떨지 궁금하다.

[링크 : https://huggingface.co/unsloth/gemma-4-E4B-it-qat-GGUF]

 

기존에 테스트 하던건 Q4_K_M 이라 비슷할진 모르겠다.

$ ../../llama-b9553/llama-cli -m gemma-4-E4B-it-qat-UD-Q2_K_XL.gguf -sm none
[ Prompt: 16.8 t/s | Generation: 38.6 t/s ]
[ Prompt: 97.9 t/s | Generation: 41.1 t/s ]
[ Prompt: 196.1 t/s | Generation: 39.9 t/s ]

 

$ ../../llama-b9553/llama-cli -m gemma-4-E4B-it-qat-UD-Q4_K_XL.gguf -sm none
[ Prompt: 737.0 t/s | Generation: 62.5 t/s ]
[ Prompt: 238.5 t/s | Generation: 61.4 t/s ]
[ Prompt: 292.3 t/s | Generation: 58.0 t/s ]

 

 

MTP

MTP는 multimodal 처럼 2개의 모델 파일이 필요하구나..

일단은 cuda enable 하고 빌드하려면.. sdk가 문제 없으려나.. 쩝

./build/bin/llama-server \
  -m gemma-4-12b-it-Q4_K_M.gguf \
  --model-draft MTP/gemma-4-12B-it-MTP-Q8_0.gguf \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  -ngl 999 -fa on
Multi GPU: add --spec-draft-device CUDA0 -sm layer.

[링크 : https://huggingface.co/unsloth/gemma-4-12b-it-GGUF/blob/main/MTP/README.md]

 

+

음.. 장렬히 빌드 시도 폭★파 ㅋㅋㅋ

$ cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61
CMAKE_BUILD_TYPE=Release
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- GGML_SYSTEM_ARCH: x86
-- Including CPU backend
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- Unable to find cublas_v2.h in either "/usr/local/cuda/include" or "/usr/math_libs/include"
-- CUDA Toolkit found
CMake Error at /usr/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:726 (message):
  Compiling the CUDA compiler identification source file
  "CMakeCUDACompilerId.cu" failed.

  Compiler: /usr/local/cuda/bin/nvcc

  Build flags:

  Id flags: --keep;--keep-dir;tmp;-gencode=arch=compute_61,code=sm_61 -v

  

  The output was:

  1

  nvcc fatal : Unsupported gpu architecture 'compute_61'

  

  

Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:6 (CMAKE_DETERMINE_COMPILER_ID_BUILD)
  /usr/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:48 (__determine_compiler_id_test)
  /usr/share/cmake-3.22/Modules/CMakeDetermineCUDACompiler.cmake:298 (CMAKE_DETERMINE_COMPILER_ID)
  ggml/src/ggml-cuda/CMakeLists.txt:59 (enable_language)


-- Configuring incomplete, errors occurred!
See also "/home/falinux/src/llama.cpp/build/CMakeFiles/CMakeOutput.log".
See also "/home/falinux/src/llama.cpp/build/CMakeFiles/CMakeError.log".

 

+

b9500 으로는 무리인가.. 아니면 vulkan 모델이라 안되는걸까?

$ ../../llama-b9500/llama-cli -m gemma-4-12b-it-Q4_0.gguf --model-draft gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4 -ngl 999 -fa on --verbose

 

0.19.888.319 E llama_model_load: error loading model: unknown model architecture: 'gemma4-assistant'
0.19.888.322 E llama_model_load_from_file_impl: failed to load model
0.19.888.324 E srv    load_model: failed to load draft model, 'gemma-4-12B-it-MTP-Q8_0.gguf'

 

b9953 으로 하니 돌아간다.

1080 ti 11GB / -sm none

$ ../../llama-b9553/llama-cli -m gemma-4-12b-it-Q4_0.gguf --model-draft gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4   -ngl 999 -fa on -sm none

 

Q4_0
[ Prompt: 48.6 t/s | Generation: 42.6 t/s ]
[ Prompt: 231.5 t/s | Generation: 36.6 t/s ]
[ Prompt: 241.1 t/s | Generation: 34.0 t/s ]

UD_Q2_K_XL
[ Prompt: 5.0 t/s | Generation: 21.1 t/s ]
[ Prompt: 80.7 t/s | Generation: 29.2 t/s ]
[ Prompt: 45.0 t/s | Generation: 24.4 t/s ]

 

1080 ti 11GB / -sm layer

$ ../../llama-b9553/llama-cli -m gemma-4-12b-it-Q4_0.gguf --model-draft gemma-4-12B-it-MTP-Q8_0.gguf --spec-type draft-mtp --spec-draft-n-max 4   -ngl 999 -fa on 

 

Q4_0
[ Prompt: 66.8 t/s | Generation: 28.5 t/s ]

[ Prompt: 126.1 t/s | Generation: 19.3 t/s ]
[ Prompt: 88.2 t/s | Generation: 16.3 t/s ]

UD_Q2_K_XL
[ Prompt: 36.5 t/s | Generation: 24.6 t/s ]
[ Prompt: 32.1 t/s | Generation: 17.1 t/s ]
[ Prompt: 47.3 t/s | Generation: 12.6 t/s ]  (한번 터졌음)

 

 

>>>>> 참조용 >>>>>

하드웨어 1080 ti -sm none

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.9s 27.94 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 255 tokens 8.9s 28.78 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,404 tokens 55s 25.45 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 29 tokens 1.2s 23.71 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 373 tokens 16s 22.28 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 806 tokens 37s 21.34 t/s (터짐)


하드웨어 1080 ti -sm layer

gemma-4 12B it Q4_0.gguf Reading Generation 25 tokens 0.8s 31.04 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 265 tokens 9.0s 29.60 t/s
gemma-4 12B it Q4_0.gguf Reading Generation 1,340 tokens 54s 24.43 t/s

gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 31 tokens 1.3s 24.16 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 263 tokens 11s 23.70 t/s
gemma-4 12B it UD Q2_K_XL.gguf Reading Generation 620 tokens 29s 20.70 t/s (터짐)

2026.06.04 - [프로그램 사용/ai 프로그램] - gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2

<<<< 참조용 <<<<

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

exllama  (0) 2026.06.10
stable diffusion train  (0) 2026.06.10
nvidia 3070 8GB 테스트 gemma4-e4b  (0) 2026.06.08
sigLIP, CLIP  (0) 2026.06.05
chatML  (0) 2026.06.04
Posted by 구차니

양자화 타입에 영향을 받을테니 bf16 이런걸 받아서 해봐야하나?

 일단.. 근소하게 1080 보단 좋긴하다. 텐서코어 쓰려면 다시 받아야 할 듯 쩝..

gemma-4-E4B-it-Q4_K_M.gguf Reading Generation 10 tokens 0.2s 61.57 t/s
gemma-4-E4B-it-Q4_K_M.gguf Reading Generation 929 tokens 16s 56.72 t/s
gemma-4-E4B-it-Q4_K_M.gguf Reading Generation 3,597 tokens 1min 8s 52.70 t/s

 

그 와중에 8기가와 11기가는 별 차이 없는것 같은데, 제법 로드 가능한 모델이 제한되네.

 

 

에라이

메모리가 적으니 멀 시도해볼수도 없네.

계륵이다 ㅠㅠ

 

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

stable diffusion train  (0) 2026.06.10
gemma4-e4b it qat / gemma4-12b mtp on 1080 ti 11GB  (0) 2026.06.08
sigLIP, CLIP  (0) 2026.06.05
chatML  (0) 2026.06.04
gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2  (0) 2026.06.04
Posted by 구차니

간단하게 영상을 분석해서 그걸 text로 매칭시켜주는 녀석이 바로 CLIP / sigLIP 같은 비전 인코더다

그럼 반대로 비전 디코더도 있을것 같은데

[링크 : https://huggingface.co/docs/transformers/v4.15.0/model_doc/visionencoderdecoder]

 

bert도 어디서 주워들은것 같은데 아무튼 얘도 비전 인코더 인듯.

LDM에선 BERT Encoder로 사용하였지만 Stable Diffusion에선 OpenAI에서 공개한 CLIP Text Encoder를 사용함

[링크 : https://velog.io/@hskhyl/Generative-AI4-imagestable-Diffusion-평가]

 

SigLIP 같은 비전 인코더

[링크 : https://wikidocs.net/blog/@jaehong/17175/]

 

Sigmoid Loss for Language Image Pre-Training
SigLIP은 CLIP에서 사용된 손실 함수를 간단한 쌍별 시그모이드 손실(pairwise sigmoid loss)로 대체할 것을 제안합니다. 이는 ImageNet에서 제로샷 분류 정확도 측면에서 더 나은 성능을 보입니다.

[링크 : https://huggingface.co/papers/2303.15343]

  [링크 : https://huggingface.co/docs/transformers/ko/model_doc/siglip]

 

CLIP(Contrastive Language-Image Pre-Training)은 다양한 이미지와 텍스트 쌍으로 훈련된 신경망 입니다. 

[링크 : https://huggingface.co/papers/2103.00020]

    [링크 : https://huggingface.co/docs/transformers/ko/model_doc/clip]

 

[링크 : https://huggingface.co/mhbkb/stable-diffusion-base-2.0-clip_1]

 

[링크 : https://dy120.tistory.com/15]

 

 

아무튼 정리하자면..

stable diffusion 에서 txt2img를 할 경우

txt2img로 사용할 임베딩 벡터를 뱉어내는 녀석이 CLIP 이고

그 이후에 노이즈를 지워가면서 그려가는게 전체 작동원리인듯 하다.

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

gemma4-e4b it qat / gemma4-12b mtp on 1080 ti 11GB  (0) 2026.06.08
nvidia 3070 8GB 테스트 gemma4-e4b  (0) 2026.06.08
chatML  (0) 2026.06.04
gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2  (0) 2026.06.04
nvidia tesla t4 16GB  (0) 2026.06.02
Posted by 구차니

 

chat template 강제 지정 (가장 흔한 해결책)
bash
./llama-cli -m model.gguf \
  --chat-template chatml \        # 또는 llama3, qwen, mistral 등
  -p "<|im_start|>system\n너는 친절한 AI야<|im_end|>"

GGUF 변환 시 chat template 명시적으로 넣기 (최신 llama.cpp)
bash
python convert_hf_to_gguf.py ./MyModel \
  --outfile mymodel.gguf \
  --chat-template chatml     # 또는 llama3-1 등

[링크 : https://x.com/i/grok/share/1f9e9bbccc264a9cbde32f7a95fdb601]

 

변환 스크립트에서 도움말을 봐도 이렇다할게 없다. 빠졌나?(b9500)

$ python3 convert_hf_to_gguf.py --help
usage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE] [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}] [--bigendian] [--use-temp-file] [--no-lazy]
                             [--model-name MODEL_NAME] [--verbose] [--split-max-tensors SPLIT_MAX_TENSORS] [--split-max-size SPLIT_MAX_SIZE] [--dry-run]
                             [--no-tensor-first-split] [--metadata METADATA] [--print-supported-models] [--remote] [--mmproj] [--mtp] [--no-mtp]
                             [--mistral-format] [--disable-mistral-community-chat-template] [--sentence-transformers-dense-modules] [--fuse-gate-up-exps]
                             [--fp8-as-q8]
                             [model]

Convert a huggingface model to a GGML compatible file

positional arguments:
  model                 directory containing model file or huggingface repository ID (if --remote)

options:
  -h, --help            show this help message and exit
  --vocab-only          extract only the vocab
  --outfile OUTFILE     path to write to; default: based on input. {ftype} will be replaced by the outtype.
  --outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}
                        output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the
                        highest-fidelity 16-bit float type
  --bigendian           model is executed on big endian machine
  --use-temp-file       use the tempfile library while processing (helpful when running out of memory, process killed)
  --no-lazy             use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)
  --model-name MODEL_NAME
                        name of the model
  --verbose             increase output verbosity
  --split-max-tensors SPLIT_MAX_TENSORS
                        max tensors in each split
  --split-max-size SPLIT_MAX_SIZE
                        max size per split N(M|G)
  --dry-run             only print out a split plan and exit, without writing any new files
  --no-tensor-first-split
                        do not add tensors to the first split (disabled by default)
  --metadata METADATA   Specify the path for an authorship metadata override file
  --print-supported-models
                        Print the supported models
  --remote              (Experimental) Read safetensors file remotely without downloading to disk. Config and tokenizer files will still be downloaded. To use
                        this feature, you need to specify Hugging Face model repo name instead of a local directory. For example:
                        'HuggingFaceTB/SmolLM2-1.7B-Instruct'. Note: To access gated repo, set HF_TOKEN environment variable to your Hugging Face token.
  --mmproj              Export multimodal projector (mmproj) for vision models. This will only work on some vision models. An 'mmproj-' prefix will be added to
                        the output file name.
  --mtp                 Export only the multi-token prediction (MTP) head as a separate GGUF, suitable for use as a speculative draft. An 'mtp-' prefix will be
                        added to the output file name.
  --no-mtp              Exclude the multi-token prediction (MTP) head from the converted GGUF. Pair with --mtp on a second run to publish trunk and MTP as two
                        files. Note: the split form duplicates embeddings, but even though the bundled default is more space-efficient overall, this allows
                        differing quantization which may be more performant.
  --mistral-format      Whether the model is stored following the Mistral format.
  --disable-mistral-community-chat-template
                        Whether to disable usage of Mistral community chat templates. If set, use the Mistral official `mistral-common` library for tokenization
                        and detokenization of Mistral models. Using `mistral-common` ensure correctness and zero-day support of tokenization for models converted
                        from the Mistral format but requires to manually setup the tokenization server.
  --sentence-transformers-dense-modules
                        Whether to include sentence-transformers dense modules. It can be used for sentence-transformers models, like google/embeddinggemma-300m.
                        Default these modules are not included.
  --fuse-gate-up-exps   Fuse gate_exps and up_exps tensors into a single gate_up_exps tensor for MoE models.
  --fp8-as-q8           Store tensors dequantized from FP8 as Q8_0 instead of BF16/F16.

 

cli 에서는 --chat-template로 어떻게 될 것 같긴한데.. 다시 해봐야겠다.

$ ./llama-cli --help
----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
-cl,   --cache-list                     show list of models in cache
--completion-bash                       print source-able bash completion script for llama.cpp
-t,    --threads N                      number of CPU threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : low(-1), normal(0), medium(1), high(2),
                                        realtime(3) (default: 0)
--poll <0...100>                        use polling level to wait for work (0 - no polling, default: 50)
-Cb,   --cpu-mask-batch M               CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
                                        (default: same as --cpu-mask)
-Crb,  --cpu-range-batch lo-hi          ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1>                use strict CPU placement (default: same as --cpu-strict)
--prio-batch N                          set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
                                        (default: 0)
--poll-batch <0|1>                      use polling to wait for work (default: same as --poll)
-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)
-n,    --predict, --n-predict N         number of tokens to predict (default: -1, -1 = infinity)
                                        (env: LLAMA_ARG_N_PREDICT)
-b,    --batch-size N                   logical maximum batch size (default: 2048)
                                        (env: LLAMA_ARG_BATCH)
-ub,   --ubatch-size N                  physical maximum batch size (default: 512)
                                        (env: LLAMA_ARG_UBATCH)
--keep N                                number of tokens to keep from the initial prompt (default: 0, -1 =
                                        all)
--swa-full                              use full-size SWA cache (default: false)
                                        [(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
                                        (env: LLAMA_ARG_SWA_FULL)
-fa,   --flash-attn [on|off|auto]       set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
                                        (env: LLAMA_ARG_FLASH_ATTN)
-p,    --prompt PROMPT                  prompt to start generation with; for system message, use -sys
--perf, --no-perf                       whether to enable internal libllama performance timings (default:
                                        false)
                                        (env: LLAMA_ARG_PERF)
-f,    --file FNAME                     a file containing the prompt (default: none)
-bf,   --binary-file FNAME              binary file containing the prompt (default: none)
-e,    --escape, --no-escape            whether to process escapes sequences (\n, \r, \t, \', \", \\)
                                        (default: true)
--rope-scaling {none,linear,yarn}       RoPE frequency scaling method, defaults to linear unless specified by
                                        the model
                                        (env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N                          RoPE context scaling factor, expands context by a factor of N
                                        (env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N                      RoPE base frequency, used by NTK-aware scaling (default: loaded from
                                        model)
                                        (env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N                     RoPE frequency scaling factor, expands context by a factor of 1/N
                                        (env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N                       YaRN: original context size of model (default: 0 = model training
                                        context size)
                                        (env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N                     YaRN: extrapolation mix factor (default: -1.00, 0.0 = full
                                        interpolation)
                                        (env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N                    YaRN: scale sqrt(t) or attention magnitude (default: -1.00)
                                        (env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N                      YaRN: high correction dim or alpha (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N                      YaRN: low correction dim or beta (default: -1.00)
                                        (env: LLAMA_ARG_YARN_BETA_FAST)
-kvo,  --kv-offload, -nkvo, --no-kv-offload
                                        whether to enable KV cache offloading (default: enabled)
                                        (env: LLAMA_ARG_KV_OFFLOAD)
--repack, -nr, --no-repack              whether to enable weight repacking (default: enabled)
                                        (env: LLAMA_ARG_REPACK)
--no-host                               bypass host buffer allowing extra buffers to be used
                                        (env: LLAMA_ARG_NO_HOST)
-ctk,  --cache-type-k TYPE              KV cache data type for K
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_K)
-ctv,  --cache-type-v TYPE              KV cache data type for V
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_CACHE_TYPE_V)
-dt,   --defrag-thold N                 KV cache defragmentation threshold (DEPRECATED)
                                        (env: LLAMA_ARG_DEFRAG_THOLD)
-np,   --parallel N                     number of parallel sequences to decode (default: 1)
                                        (env: LLAMA_ARG_N_PARALLEL)
--rpc SERVERS                           comma-separated list of RPC servers (host:port)
                                        (env: LLAMA_ARG_RPC)
--mlock                                 force system to keep model in RAM rather than swapping or compressing
                                        (env: LLAMA_ARG_MLOCK)
--mmap, --no-mmap                       whether to memory-map model. (if mmap disabled, slower load but may
                                        reduce pageouts if not using mlock) (default: enabled)
                                        (env: LLAMA_ARG_MMAP)
-dio,  --direct-io, -ndio, --no-direct-io
                                        use DirectIO if available. (default: disabled)
                                        (env: LLAMA_ARG_DIO)
--numa TYPE                             attempt optimizations that help on some NUMA systems
                                        - distribute: spread execution evenly over all nodes
                                        - isolate: only spawn threads on CPUs on the node that execution
                                        started on
                                        - numactl: use the CPU map provided by numactl
                                        if run without this previously, it is recommended to drop the system
                                        page cache before using this
                                        see https://github.com/ggml-org/llama.cpp/issues/1437
                                        (env: LLAMA_ARG_NUMA)
-dev,  --device <dev1,dev2,..>          comma-separated list of devices to use for offloading (none = don't
                                        offload)
                                        use --list-devices to see a list of available devices
                                        (env: LLAMA_ARG_DEVICE)
--list-devices                          print list of available devices and exit
-ot,   --override-tensor <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type
                                        (env: LLAMA_ARG_OVERRIDE_TENSOR)
-cmoe, --cpu-moe                        keep all Mixture of Experts (MoE) weights in the CPU
                                        (env: LLAMA_ARG_CPU_MOE)
-ncmoe, --n-cpu-moe N                   keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU
                                        (env: LLAMA_ARG_N_CPU_MOE)
-ngl,  --gpu-layers, --n-gpu-layers N   max. number of layers to store in VRAM, either an exact number,
                                        'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS)
-sm,   --split-mode {none,layer,row,tensor}
                                        how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs (pipelined)
                                        - row: split weight across GPUs by rows (parallelized)
                                        - tensor: split weights and KV across GPUs (parallelized,
                                        EXPERIMENTAL)
                                        (env: LLAMA_ARG_SPLIT_MODE)
-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1
                                        (env: LLAMA_ARG_TENSOR_SPLIT)
-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)
                                        (env: LLAMA_ARG_MAIN_GPU)
-fit,  --fit [on|off]                   whether to adjust unset arguments to fit in device memory ('on' or
                                        'off', default: 'on')
                                        (env: LLAMA_ARG_FIT)
-fitt, --fit-target MiB0,MiB1,MiB2,...
                                        target margin per device for --fit, comma-separated list of values,
                                        single value is broadcast across all devices, default: 1024
                                        (env: LLAMA_ARG_FIT_TARGET)
-fitc, --fit-ctx N                      minimum ctx size that can be set by --fit option, default: 4096
                                        (env: LLAMA_ARG_FIT_CTX)
--check-tensors                         check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE,...        advanced option to override model metadata by key. to specify multiple
                                        overrides, either use comma-separated values.
                                        types: int, float, bool, str. example: --override-kv
                                        tokenizer.ggml.add_bos_token=bool:false,tokenizer.ggml.add_eos_token=bool:false
--op-offload, --no-op-offload           whether to offload host tensor operations to device (default: true)
--lora FNAME                            path to LoRA adapter (use comma-separated values to load multiple
                                        adapters)
--lora-scaled FNAME:SCALE,...           path to LoRA adapter with user defined scaling (format:
                                        FNAME:SCALE,...)
                                        note: use comma-separated values
--control-vector FNAME                  add a control vector
                                        note: use comma-separated values to add multiple control vectors
--control-vector-scaled FNAME:SCALE,...
                                        add a control vector with user defined scaling SCALE
                                        note: use comma-separated values (format: FNAME:SCALE,...)
--control-vector-layer-range START END
                                        layer range to apply the control vector(s) to, start and end inclusive
-m,    --model FNAME                    model path to load
                                        (env: LLAMA_ARG_MODEL)
-mu,   --model-url MODEL_URL            model download url (default: unused)
                                        (env: LLAMA_ARG_MODEL_URL)
-dr,   --docker-repo [<repo>/]<model>[:quant]
                                        Docker Hub model repository. repo is optional, default to ai/. quant
                                        is optional, default to :latest.
                                        example: gemma3
                                        (default: unused)
                                        (env: LLAMA_ARG_DOCKER_REPO)
-hf,   -hfr, --hf-repo <user>/<model>[:quant]
                                        Hugging Face model repository; quant is optional, case-insensitive,
                                        default to Q4_K_M, or falls back to the first file in the repo if
                                        Q4_K_M doesn't exist.
                                        mmproj is also downloaded automatically if available. to disable, add
                                        --no-mmproj
                                        example: ggml-org/GLM-4.7-Flash-GGUF:Q4_K_M
                                        (default: unused)
                                        (env: LLAMA_ARG_HF_REPO)
-hff,  --hf-file FILE                   Hugging Face model file. If specified, it will override the quant in
                                        --hf-repo (default: unused)
                                        (env: LLAMA_ARG_HF_FILE)
-hfv,  -hfrv, --hf-repo-v <user>/<model>[:quant]
                                        Hugging Face model repository for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE                 Hugging Face model file for the vocoder model (default: unused)
                                        (env: LLAMA_ARG_HF_FILE_V)
-hft,  --hf-token TOKEN                 Hugging Face access token (default: value from HF_TOKEN environment
                                        variable)
                                        (env: HF_TOKEN)
--log-disable                           Log disable
--log-file FNAME                        Log to file
                                        (env: LLAMA_ARG_LOG_FILE)
--log-colors [on|off|auto]              Set colored logging ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
                                        (env: LLAMA_ARG_LOG_COLORS)
-v,    --verbose, --log-verbose         Set verbosity level to infinity (i.e. log all messages, useful for
                                        debugging)
--offline                               Offline mode: forces use of cache, prevents network access
                                        (env: LLAMA_ARG_OFFLINE)
-lv,   --verbosity, --log-verbosity N   Set the verbosity threshold. Messages with a higher verbosity will be
                                        ignored. Values:
                                         - 0: generic output
                                         - 1: error
                                         - 2: warning
                                         - 3: info
                                         - 4: trace (more info)
                                         - 5: debug
                                        (default: 1)
                                        
                                        (env: LLAMA_ARG_LOG_VERBOSITY)
--log-prefix, --no-log-prefix           Enable prefix in log messages
                                        (env: LLAMA_ARG_LOG_PREFIX)
--log-timestamps, --no-log-timestamps   Enable timestamps in log messages
                                        (env: LLAMA_ARG_LOG_TIMESTAMPS)
--spec-draft-type-k, -ctkd, --cache-type-k-draft TYPE
                                        KV cache data type for K for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_SPEC_DRAFT_CACHE_TYPE_K)
--spec-draft-type-v, -ctvd, --cache-type-v-draft TYPE
                                        KV cache data type for V for the draft model
                                        allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
                                        (default: f16)
                                        (env: LLAMA_ARG_SPEC_DRAFT_CACHE_TYPE_V)


----- sampling params -----

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'
                                        (default:
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s,    --seed SEED                      RNG seed (default: -1, use random seed for -1)
--sampler-seq, --sampling-seq SEQUENCE
                                        simplified sequence for samplers that will be used (default:
                                        edskypmxt)
--ignore-eos                            ignore end of stream token and continue generating (implies
                                        --logit-bias EOS-inf)
--temp, --temperature N                 temperature (default: 0.80)
--top-k N                               top-k sampling (default: 40, 0 = disabled)
                                        (env: LLAMA_ARG_TOP_K)
--top-p N                               top-p sampling (default: 0.95, 1.0 = disabled)
--min-p N                               min-p sampling (default: 0.05, 0.0 = disabled)
--top-nsigma, --top-n-sigma N           top-n-sigma sampling (default: -1.00, -1.0 = disabled)
--xtc-probability N                     xtc probability (default: 0.00, 0.0 = disabled)
--xtc-threshold N                       xtc threshold (default: 0.10, 1.0 = disabled)
--typical, --typical-p N                locally typical sampling, parameter p (default: 1.00, 1.0 = disabled)
--repeat-last-n N                       last n tokens to consider for penalize (default: 64, 0 = disabled, -1
                                        = ctx_size)
--repeat-penalty N                      penalize repeat sequence of tokens (default: 1.00, 1.0 = disabled)
--presence-penalty N                    repeat alpha presence penalty (default: 0.00, 0.0 = disabled)
--frequency-penalty N                   repeat alpha frequency penalty (default: 0.00, 0.0 = disabled)
--dry-multiplier N                      set DRY sampling multiplier (default: 0.00, 0.0 = disabled)
--dry-base N                            set DRY sampling base value (default: 1.75)
--dry-allowed-length N                  set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N                  set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
                                        context size)
--dry-sequence-breaker STRING           add sequence breaker for DRY sampling, clearing out default breakers
                                        ('\n', ':', '"', '*') in the process; use "none" to not use any
                                        sequence breakers
--adaptive-target N                     adaptive-p: select tokens near this probability (valid range 0.0 to
                                        1.0; negative = disabled) (default: -1.00)
                                        [(more info)](https://github.com/ggml-org/llama.cpp/pull/17927)
--adaptive-decay N                      adaptive-p: decay rate for target adaptation over time. lower values
                                        are more reactive, higher values are more stable.
                                        (valid range 0.0 to 0.99) (default: 0.90)
--dynatemp-range N                      dynamic temperature range (default: 0.00, 0.0 = disabled)
--dynatemp-exp N                        dynamic temperature exponent (default: 1.00)
--mirostat N                            use Mirostat sampling.
                                        Top K, Nucleus and Locally Typical samplers are ignored if used.
                                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N                         Mirostat learning rate, parameter eta (default: 0.10)
--mirostat-ent N                        Mirostat target entropy, parameter tau (default: 5.00)
-l,    --logit-bias TOKEN_ID(+/-)BIAS   modifies the likelihood of token appearing in the completion,
                                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
--grammar GRAMMAR                       BNF-like grammar to constrain generations (see samples in grammars/
                                        dir)
--grammar-file FNAME                    file to read grammar from
-j,    --json-schema SCHEMA             JSON schema to constrain generations (https://json-schema.org/), e.g.
                                        `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-jf,   --json-schema-file FILE          File containing a JSON schema to constrain generations
                                        (https://json-schema.org/), e.g. `{}` for any JSON object
                                        For schemas w/ external $refs, use --grammar +
                                        example/json_schema_to_grammar.py instead
-bs,   --backend-sampling               enable backend sampling (experimental) (default: disabled)
                                        (env: LLAMA_ARG_BACKEND_SAMPLING)


----- speculative params -----

--spec-draft-hf, -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant]
                                        Same as --hf-repo, but for the draft model (default: unused)
                                        (env: LLAMA_ARG_SPEC_DRAFT_HF_REPO)
--spec-draft-threads, -td, --threads-draft N
                                        number of threads to use during generation (default: same as
                                        --threads)
--spec-draft-threads-batch, -tbd, --threads-batch-draft N
                                        number of threads to use during batch and prompt processing (default:
                                        same as --threads-draft)
--spec-draft-cpu-mask, -Cd, --cpu-mask-draft M
                                        Draft model CPU affinity mask. Complements cpu-range-draft (default:
                                        same as --cpu-mask)
--spec-draft-cpu-range, -Crd, --cpu-range-draft lo-hi
                                        Ranges of CPUs for affinity. Complements --cpu-mask-draft
--spec-draft-cpu-strict, --cpu-strict-draft <0|1>
                                        Use strict CPU placement for draft model (default: same as
                                        --cpu-strict)
--spec-draft-prio, --prio-draft N       set draft process/thread priority : 0-normal, 1-medium, 2-high,
                                        3-realtime (default: 0)
--spec-draft-poll, --poll-draft <0|1>   Use polling to wait for draft model work (default: same as --poll)
--spec-draft-cpu-mask-batch, -Cbd, --cpu-mask-batch-draft M
                                        Draft model CPU affinity mask. Complements cpu-range-draft (default:
                                        same as --cpu-mask)
--spec-draft-cpu-strict-batch, --cpu-strict-batch-draft <0|1>
                                        Use strict CPU placement for draft model (default: --cpu-strict-draft)
--spec-draft-prio-batch, --prio-batch-draft N
                                        set draft process/thread priority : 0-normal, 1-medium, 2-high,
                                        3-realtime (default: 0)
--spec-draft-poll-batch, --poll-batch-draft <0|1>
                                        Use polling to wait for draft model work (default: --poll-draft)
--spec-draft-override-tensor, -otd, --override-tensor-draft <tensor name pattern>=<buffer type>,...
                                        override tensor buffer type for draft model
--spec-draft-cpu-moe, -cmoed, --cpu-moe-draft
                                        keep all Mixture of Experts (MoE) weights in the CPU for the draft
                                        model
                                        (env: LLAMA_ARG_SPEC_DRAFT_CPU_MOE)
--spec-draft-n-cpu-moe, --spec-draft-ncmoe, -ncmoed, --n-cpu-moe-draft N
                                        keep the Mixture of Experts (MoE) weights of the first N layers in the
                                        CPU for the draft model
                                        (env: LLAMA_ARG_SPEC_DRAFT_N_CPU_MOE)
--spec-draft-n-max N                    number of tokens to draft for speculative decoding (default: 3)
                                        (env: LLAMA_ARG_SPEC_DRAFT_N_MAX)
--spec-draft-n-min N                    minimum number of draft tokens to use for speculative decoding
                                        (default: 0)
                                        (env: LLAMA_ARG_SPEC_DRAFT_N_MIN)
--spec-draft-p-split, --draft-p-split P
                                        speculative decoding split probability (default: 0.10)
                                        (env: LLAMA_ARG_SPEC_DRAFT_P_SPLIT)
--spec-draft-p-min, --draft-p-min P     minimum speculative decoding probability (greedy) (default: 0.00)
                                        (env: LLAMA_ARG_SPEC_DRAFT_P_MIN)
--spec-draft-backend-sampling, --no-spec-draft-backend-sampling
                                        offload draft sampling to the backend (default: enabled)
                                        (env: LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING)
--spec-draft-device, -devd, --device-draft <dev1,dev2,..>
                                        comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
--spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        max. number of draft model layers to store in VRAM, either an exact
                                        number, 'auto', or 'all' (default: auto)
                                        (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
--spec-draft-model, -md, --model-draft FNAME
                                        draft model for speculative decoding (default: unused)
                                        (env: LLAMA_ARG_SPEC_DRAFT_MODEL)
--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache
                                        comma-separated list of types of speculative decoding to use (default:
                                        none)
                                        
                                        (env: LLAMA_ARG_SPEC_TYPE)
--spec-ngram-mod-n-min N                minimum number of ngram tokens to use for ngram-based speculative
                                        decoding (default: 48)
--spec-ngram-mod-n-max N                maximum number of ngram tokens to use for ngram-based speculative
                                        decoding (default: 64)
--spec-ngram-mod-n-match N              ngram-mod lookup length (default: 24)
--spec-ngram-simple-size-n N            ngram size N for ngram-simple speculative decoding, length of lookup
                                        n-gram (default: 12)
--spec-ngram-simple-size-m N            ngram size M for ngram-simple speculative decoding, length of draft
                                        m-gram (default: 48)
--spec-ngram-simple-min-hits N          minimum hits for ngram-simple speculative decoding (default: 1)
--spec-ngram-map-k-size-n N             ngram size N for ngram-map-k speculative decoding, length of lookup
                                        n-gram (default: 12)
--spec-ngram-map-k-size-m N             ngram size M for ngram-map-k speculative decoding, length of draft
                                        m-gram (default: 48)
--spec-ngram-map-k-min-hits N           minimum hits for ngram-map-k speculative decoding (default: 1)
--spec-ngram-map-k4v-size-n N           ngram size N for ngram-map-k4v speculative decoding, length of lookup
                                        n-gram (default: 12)
--spec-ngram-map-k4v-size-m N           ngram size M for ngram-map-k4v speculative decoding, length of draft
                                        m-gram (default: 48)
--spec-ngram-map-k4v-min-hits N         minimum hits for ngram-map-k4v speculative decoding (default: 1)
--draft, --draft-n, --draft-max N       the argument has been removed. use --spec-draft-n-max or
                                        --spec-ngram-mod-n-max
                                        (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N            the argument has been removed. use --spec-draft-n-min or
                                        --spec-ngram-mod-n-min
                                        (env: LLAMA_ARG_DRAFT_MIN)


----- example-specific params -----

--verbose-prompt                        print a verbose prompt before generation (default: false)
--display-prompt, --no-display-prompt   whether to print prompt at generation (default: true)
-co,   --color [on|off|auto]            Colorize output to distinguish prompt and user input from generations
                                        ('on', 'off', or 'auto', default: 'auto')
                                        'auto' enables colors when output is to a terminal
-ctxcp, --ctx-checkpoints, --swa-checkpoints N
                                        max number of context checkpoints to create per slot (default:
                                        32)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)
                                        (env: LLAMA_ARG_CTX_CHECKPOINTS)
-cram, --cache-ram N                    set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
                                        disable)[(more
                                        info)](https://github.com/ggml-org/llama.cpp/pull/16391)
                                        (env: LLAMA_ARG_CACHE_RAM)
--context-shift, --no-context-shift     whether to use context shift on infinite text generation (default:
                                        disabled)
                                        (env: LLAMA_ARG_CONTEXT_SHIFT)
-sys,  --system-prompt PROMPT           system prompt to use with model (if applicable, depending on chat
                                        template)
--show-timings, --no-show-timings       whether to show timing information after each response (default: true)
                                        (env: LLAMA_ARG_SHOW_TIMINGS)
-sysf, --system-prompt-file FNAME       a file containing the system prompt (default: none)
-r,    --reverse-prompt PROMPT          halt generation at PROMPT, return control in interactive mode
-sp,   --special                        special tokens output enabled (default: false)
-cnv,  --conversation, -no-cnv, --no-conversation
                                        whether to run in conversation mode:
                                        - does not print special tokens and suffix/prefix
                                        - interactive mode is also enabled
                                        (default: auto enabled if chat template is available)
-st,   --single-turn                    run conversation for a single turn only, then exit when done
                                        will not be interactive if first turn is predefined with --prompt
                                        (default: false)
-mli,  --multiline-input                allows you to write or paste multiple lines without ending each in '\'
--warmup, --no-warmup                   whether to perform warmup with an empty run (default: enabled)
-mm,   --mmproj FILE                    path to a multimodal projector file. see tools/mtmd/README.md
                                        note: if -hf is used, this argument can be omitted
                                        (env: LLAMA_ARG_MMPROJ)
-mmu,  --mmproj-url URL                 URL to a multimodal projector file. see tools/mtmd/README.md
                                        (env: LLAMA_ARG_MMPROJ_URL)
--mmproj-auto, --no-mmproj, --no-mmproj-auto
                                        whether to use multimodal projector file (if available), useful when
                                        using -hf (default: enabled)
                                        (env: LLAMA_ARG_MMPROJ_AUTO)
--mmproj-offload, --no-mmproj-offload   whether to enable GPU offloading for multimodal projector (default:
                                        enabled)
                                        (env: LLAMA_ARG_MMPROJ_OFFLOAD)
--image, --audio FILE                   path to an image or audio file. use with multimodal models, use
                                        comma-separated values for multiple files
--image-min-tokens N                    minimum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N                    maximum number of tokens each image can take, only used by vision
                                        models with dynamic resolution (default: read from model)
                                        (env: LLAMA_ARG_IMAGE_MAX_TOKENS)
--chat-template-kwargs STRING           sets additional params for the json template parser, must be a valid
                                        json object string, e.g. '{"key1":"value1","key2":"value2"}'
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_KWARGS)
--jinja, --no-jinja                     whether to use jinja template engine for chat (default: enabled)
                                        (env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT               controls whether thought tags are allowed and/or extracted from the
                                        response, and in which format they're returned; one of:
                                        - none: leaves thoughts unparsed in `message.content`
                                        - deepseek: puts thoughts in `message.reasoning_content`
                                        - deepseek-legacy: keeps `<think>` tags in `message.content` while
                                        also populating `message.reasoning_content`
                                        (default: auto)
                                        (env: LLAMA_ARG_THINK)
-rea,  --reasoning [on|off|auto]        Use reasoning/thinking in the chat ('on', 'off', or 'auto', default:
                                        'auto' (detect from template))
                                        (env: LLAMA_ARG_REASONING)
--reasoning-budget N                    token budget for thinking: -1 for unrestricted, 0 for immediate end,
                                        N>0 for token budget (default: -1)
                                        (env: LLAMA_ARG_THINK_BUDGET)
--reasoning-budget-message MESSAGE      message injected before the end-of-thinking tag when reasoning budget
                                        is exhausted (default: none)
                                        (env: LLAMA_ARG_THINK_BUDGET_MESSAGE)
--chat-template JINJA_TEMPLATE          set custom jinja chat template (default: template taken from model's
                                        metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, granite-4.1, grok-2, hunyuan-dense, hunyuan-moe,
                                        hunyuan-vl, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
                                        llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
                                        mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch,
                                        openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss,
                                        smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
                                        set custom jinja chat template file (default: template taken from
                                        model's metadata)
                                        if suffix/prefix are specified, template will be disabled
                                        only commonly used templates are accepted (unless --jinja is set
                                        before this flag):
                                        list of built-in templates:
                                        bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
                                        command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
                                        exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
                                        granite-4.0, granite-4.1, grok-2, hunyuan-dense, hunyuan-moe,
                                        hunyuan-vl, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
                                        llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
                                        mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch,
                                        openchat, orion, pangu-embedded, phi3, phi4, rwkv-world, seed_oss,
                                        smolvlm, solar-open, vicuna, vicuna-orca, yandex, zephyr
                                        (env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--skip-chat-parsing, --no-skip-chat-parsing
                                        force a pure content parser, even if a Jinja template is specified;
                                        model will output everything in the content section, including any
                                        reasoning and/or tool calls (default: disabled)
                                        (env: LLAMA_ARG_SKIP_CHAT_PARSING)
--simple-io                             use basic IO for better compatibility in subprocesses and limited
                                        consoles
--gpt-oss-20b-default                   use gpt-oss-20b (note: can download weights from the internet)
--gpt-oss-120b-default                  use gpt-oss-120b (note: can download weights from the internet)
--vision-gemma-4b-default               use Gemma 3 4B QAT (note: can download weights from the internet)
--vision-gemma-12b-default              use Gemma 3 12B QAT (note: can download weights from the internet)
--spec-default                          enable default speculative decoding config

 

[링크 : https://www.sktenterprise.com/bizInsight/blogDetail/dev/10236]

[링크 : https://huggingface.co/google/gemma-1.1-2b-it]

[링크 : https://huggingface.co/google/gemma-4-12B-it]

[링크 : https://huggingface.co/google/gemma-4-E4B-it]

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

nvidia 3070 8GB 테스트 gemma4-e4b  (0) 2026.06.08
sigLIP, CLIP  (0) 2026.06.05
gemma 12b, tesla t4 16GB / 1080 ti 11GB * 2  (0) 2026.06.04
nvidia tesla t4 16GB  (0) 2026.06.02
llama.cpp reasoning 옵션  (0) 2026.06.01
Posted by 구차니