google ai 요약에 따르면
ollama는 VRAM 합산해서 큰 용량의 모델을 돌릴수 있다고 한다.
Ollama supports multiple GPUs (both NVIDIA and AMD) by automatically splitting model layers across available VRAM, allowing users to run large models that exceed the capacity of a single card.
Reddit +1
VRAM Aggregation and Usage
- Total Capacity: Ollama sums the total VRAM across all detected GPUs (e.g., two 16GB GPUs behave as 32GB).
- Layer Splitting: If a model cannot fit into one GPU, Ollama splits the layers, distributing the load between GPU (fast) and system RAM (slow).
- Example Scenario: A 30GB Q4_K_M model (e.g., Llama 3 70B) can be split across two 16GB or two 24GB GPUs, loading roughly 50% on each.
- KV Cache: A portion of VRAM is reserved for the context window (KV cache), which can take several GBs depending on context length.
Reddit +3
Multi-GPU Performance Considerations
- Optimal Use: If the model fits into a single GPU, Ollama will typically use only one card for the best performance, as splitting increases data transfer between cards.
- Performance Penalty: Partial offloading (spreading a model across both GPU and CPU) can cause performance to degrade by 5–20x.
- Mixing GPUs: You can mix different GPUs (e.g., a 3090 24GB + 1660 6GB), but the overall performance might be bottlenecked by the slower card.
- Multi-Instance Setup: To maximize performance on multiple GPUs, you can run multiple instances of Ollama on different ports, each assigned to a specific GPU (e.g., using CUDA_VISIBLE_DEVICES).
Reddit +4
How to Configure
- NVIDIA: Set CUDA_VISIBLE_DEVICES=0,1 (comma-separated IDs) to enable multi-GPU.
- AMD: Use ROCR_VISIBLE_DEVICES=0,1.
- Automatic Split: By default, Ollama handles the split automatically, but you can set PARAMETER num_gpu 999 in a Modelfile to force all layers to be spread across all available GPUs.
LocalLLM.in +1
VRAM Requirements by Model Size
- Entry-level (3-4GB VRAM): 3-4B models (Q4_K_M).
- Mid-range (6-8GB VRAM): 7-9B models (Q4_K_M).
- High-end (10-12GB VRAM): 12-14B models (Q4_K_M).
- Dual GPU Setup (16-24GB VRAM): 22-35B models (Gemma 3 27B, Qwen3 32B).
- Workstation (48GB+ VRAM): 70B+ models (Llama 3.3 70B).
LocalLLM.in
|
그래서 dual GPU setup 으로 이야기가 나오는데,
아무튼 24G VRAM 두개 해서 48G 로 해서 70B 모델을 돌릴수 있다고 한다.
하나 더 지르고.. 메인보드도 sli/crossfire 지원으로 바꾸고 파워도 올리고.. 해야하나?
Can I use multiple GPUs with Ollama for larger models? Yes, Ollama supports multi-GPU configurations for NVIDIA and AMD cards. For NVIDIA, set CUDA_VISIBLE_DEVICES to comma-separated GPU IDs to distribute model layers across multiple GPUs. This enables running 70B models on dual 24GB GPUs (48GB total) that wouldn't fit on a single card. For AMD GPUs, use ROCR_VISIBLE_DEVICES with the same approach to leverage combined VRAM across multiple cards. |
[링크 : https://localllm.in/blog/ollama-vram-requirements-for-local-llms]