구차니의 잡동사니 모음 :: 사람의 욕심은 끝이없고

프로그램 사용/ai 프로그램2026. 4. 17. 10:29

사람의 욕심은 끝이없고 - ollama multiple GPU support

google ai 요약에 따르면

ollama는 VRAM 합산해서 큰 용량의 모델을 돌릴수 있다고 한다.

Ollama supports multiple GPUs (both NVIDIA and AMD) by automatically splitting model layers across available VRAM, allowing users to run large models that exceed the capacity of a single card.

Reddit +1

VRAM Aggregation and Usage

Total Capacity: Ollama sums the total VRAM across all detected GPUs (e.g., two 16GB GPUs behave as 32GB).
Layer Splitting: If a model cannot fit into one GPU, Ollama splits the layers, distributing the load between GPU (fast) and system RAM (slow).
Example Scenario: A 30GB Q4_K_M model (e.g., Llama 3 70B) can be split across two 16GB or two 24GB GPUs, loading roughly 50% on each.
KV Cache: A portion of VRAM is reserved for the context window (KV cache), which can take several GBs depending on context length.
Reddit +3

Multi-GPU Performance Considerations

Optimal Use: If the model fits into a single GPU, Ollama will typically use only one card for the best performance, as splitting increases data transfer between cards.
Performance Penalty: Partial offloading (spreading a model across both GPU and CPU) can cause performance to degrade by 5–20x.
Mixing GPUs: You can mix different GPUs (e.g., a 3090 24GB + 1660 6GB), but the overall performance might be bottlenecked by the slower card.
Multi-Instance Setup: To maximize performance on multiple GPUs, you can run multiple instances of Ollama on different ports, each assigned to a specific GPU (e.g., using CUDA_VISIBLE_DEVICES).
Reddit +4

How to Configure

NVIDIA: Set CUDA_VISIBLE_DEVICES=0,1 (comma-separated IDs) to enable multi-GPU.
AMD: Use ROCR_VISIBLE_DEVICES=0,1.
Automatic Split: By default, Ollama handles the split automatically, but you can set PARAMETER num_gpu 999 in a Modelfile to force all layers to be spread across all available GPUs.
LocalLLM.in +1

VRAM Requirements by Model Size

Entry-level (3-4GB VRAM): 3-4B models (Q4_K_M).
Mid-range (6-8GB VRAM): 7-9B models (Q4_K_M).
High-end (10-12GB VRAM): 12-14B models (Q4_K_M).
Dual GPU Setup (16-24GB VRAM): 22-35B models (Gemma 3 27B, Qwen3 32B).
Workstation (48GB+ VRAM): 70B+ models (Llama 3.3 70B).
LocalLLM.in

그래서 dual GPU setup 으로 이야기가 나오는데,

아무튼 24G VRAM 두개 해서 48G 로 해서 70B 모델을 돌릴수 있다고 한다.

하나 더 지르고.. 메인보드도 sli/crossfire 지원으로 바꾸고 파워도 올리고.. 해야하나?

Can I use multiple GPUs with Ollama for larger models?
Yes, Ollama supports multi-GPU configurations for NVIDIA and AMD cards. For NVIDIA, set CUDA_VISIBLE_DEVICES to comma-separated GPU IDs to distribute model layers across multiple GPUs. This enables running 70B models on dual 24GB GPUs (48GB total) that wouldn't fit on a single card. For AMD GPUs, use ROCR_VISIBLE_DEVICES with the same approach to leverage combined VRAM across multiple cards.

[링크 : https://localllm.in/blog/ollama-vram-requirements-for-local-llms]

저작자표시 (새창열림)

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

llama.cpp (0)	2026.04.17
lm studio (0)	2026.04.17
ollama with 1080 Ti (0)	2026.04.16
트랜스포머 모델 입/출력 (0)	2026.04.12
ollama 소스코드 (0)	2026.04.12

Posted by 구차니

구차니의 잡동사니 모음

사람의 욕심은 끝이없고 - ollama multiple GPU support

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바