vllm 설치, 실행 실패 -> 사실상 포기

프로그램 사용/ai 프로그램2026. 5. 24. 17:56

vllm 설치, 실행 실패 -> 사실상 포기

실행하라면 이렇게 하라고 하는데

vllm serve google/gemma-4-E4B-it \
--max-model-len <n_of_tokens> # up to 131072

[링크 : https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html]

아따.. 드럽게 크네. 그나저나 허깅페이스에서 바로 받으려나?

그리고 gguf가 양자화 되서 작은거였나. 기존에 내가 쓰던데 Q4_K_M 이라 4.7기가 정도 되었는데

model.safetensors는 16기가나 된다. 와우

[링크 : https://huggingface.co/google/gemma-4-E4B]

vllm : 1080 ti 라니 불량식품이잖아! 퉤!

$ vllm serve google/gemma-4-E4B-it
(APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306]
(APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306]        █     █     █▄   ▄█
(APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.21.0
(APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306]   █▄█▀ █     █     █     █  model   google/gemma-4-E4B-it
(APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306]
(APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:240] non-default args: {'model_tag': 'google/gemma-4-E4B-it', 'model': 'google/gemma-4-E4B-it'}
(APIServer pid=52696) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 5.14kB [00:00, 17.7MB/s]
processor_config.json: 1.69kB [00:00, 1.70MB/s]
(APIServer pid=52696) INFO 05-24 22:20:29 [model.py:568] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=52696) WARNING 05-24 22:20:29 [model.py:1982] Your device 'NVIDIA GeForce GTX 1080 Ti' (with compute capability 6.1) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=52696) WARNING 05-24 22:20:29 [model.py:2035] Casting torch.bfloat16 to torch.float16.
(APIServer pid=52696) INFO 05-24 22:20:29 [model.py:1697] Using max model len 131072
(APIServer pid=52696) INFO 05-24 22:20:29 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=52696) INFO 05-24 22:20:29 [vllm.py:886] Asynchronous scheduling is enabled.
(APIServer pid=52696) INFO 05-24 22:20:29 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
tokenizer_config.json: 2.10kB [00:00, 2.05MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 14.8MB/s]
chat_template.jinja: 17.3kB [00:00, 12.6MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.68MB/s]
(EngineCore pid=52767) INFO 05-24 22:21:24 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='google/gemma-4-E4B-it', speculative_config=None, tokenizer='google/gemma-4-E4B-it', skip_t
okenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=
1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False,
enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=Fal
se, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_tr
aces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_itera
tion_details=False), seed=0, served_model_name=google/gemma-4-E4B-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE:
3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_w
ith_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm
::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_
cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'en
coder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_as
serts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups
': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 3
20, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant':
False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_s
hapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []},
kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1.
(EngineCore pid=52767) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(EngineCore pid=52767) - 7.5 which supports hardware CC >=7.5,<8.0
(EngineCore pid=52767) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}                                                                                                                      [108/299]
(EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=52767) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(EngineCore pid=52767) - 12.0 which supports hardware CC >=12.0,<13.0
(EngineCore pid=52767) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6
(EngineCore pid=52767)   _warn_unsupported_code(d, device_cc, code_ccs)
(EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1.
(EngineCore pid=52767) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(EngineCore pid=52767) - 7.5 which supports hardware CC >=7.5,<8.0                                                                                                                                            (EngineCore pid=52767) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=52767) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(EngineCore pid=52767) - 12.0 which supports hardware CC >=12.0,<13.0
(EngineCore pid=52767) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6
(EngineCore pid=52767)   _warn_unsupported_code(d, device_cc, code_ccs)
(EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:489: UserWarning:
(EngineCore pid=52767) NVIDIA GeForce GTX 1080 Ti with CUDA capability sm_61 is not compatible with the current PyTorch installation.
(EngineCore pid=52767) The current PyTorch install supports CUDA capabilities sm_75 sm_80 sm_86 sm_90 sm_100 sm_120.
(EngineCore pid=52767) If you want to use the NVIDIA GeForce GTX 1080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
(EngineCore pid=52767)
(EngineCore pid=52767)   queued_call()
(EngineCore pid=52767) INFO 05-24 22:21:30 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.40.238:47913 backend=nccl
(EngineCore pid=52767) INFO 05-24 22:21:30 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=52767) WARNING 05-24 22:21:31 [topk_topp_sampler.py:61] FlashInfer top-p/top-k sampling not supported on compute capability 6.1; falling back to PyTorch-native sampler. Set VLLM_USE_FLASHINF
ER_SAMPLER=0 to silence.
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] EngineCore failed to start.
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Traceback (most recent call last):
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     super().__init__(
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self._init_executor()
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.driver_worker.init_device()
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.worker.init_device()  # type: ignore
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.input_batch = InputBatch(
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.block_table = MultiGroupBlockTable(
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.block_tables = [
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp>
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     BlockTable(
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp>                          [54/299]
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     BlockTable(
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.block_table = self._make_buffer(
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     return CpuGpuBuffer(
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]     self.gpu = torch.zeros_like(self.cpu, device=device)
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140]
(EngineCore pid=52767) Process EngineCore:
(EngineCore pid=52767) Traceback (most recent call last):
(EngineCore pid=52767)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=52767)     self.run()
(EngineCore pid=52767)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore pid=52767)     self._target(*self._args, **self._kwargs)
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1144, in run_engine_core
(EngineCore pid=52767)     raise e
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=52767)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=52767)     return func(*args, **kwargs)
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=52767)     super().__init__(
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=52767)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=52767)     return func(*args, **kwargs)
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=52767)     self._init_executor()
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor
(EngineCore pid=52767)     self.driver_worker.init_device()
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(EngineCore pid=52767)     self.worker.init_device()  # type: ignore
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=52767)     return func(*args, **kwargs)
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device
(EngineCore pid=52767)     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__
(EngineCore pid=52767)     self.input_batch = InputBatch(
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__
(EngineCore pid=52767)     self.block_table = MultiGroupBlockTable(
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__
(EngineCore pid=52767)     self.block_tables = [
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp>
(EngineCore pid=52767)     BlockTable(
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__
(EngineCore pid=52767)     self.block_table = self._make_buffer(
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer
(EngineCore pid=52767)     return CpuGpuBuffer(
(EngineCore pid=52767)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__
(EngineCore pid=52767)     self.gpu = torch.zeros_like(self.cpu, device=device)
(EngineCore pid=52767) torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(EngineCore pid=52767) Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=52767) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=52767) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=52767) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=52767)
[rank0]:[W524 22:21:32.199264489 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.
org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=52696) Traceback (most recent call last):
(APIServer pid=52696)   File "/home/minimonk/.local/bin/vllm", line 8, in <module>
(APIServer pid=52696)     sys.exit(main())
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=52696)     args.dispatch_function(args)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=52696)     uvloop.run(run_server(args))
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
(APIServer pid=52696)     return loop.run_until_complete(wrapper())
(APIServer pid=52696)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=52696)     return await main
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 693, in run_server
(APIServer pid=52696)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 707, in run_server_worker
(APIServer pid=52696)     async with build_async_engine_client(
(APIServer pid=52696)   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=52696)     return await anext(self.gen)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=52696)     async with build_async_engine_client_from_engine_args(
(APIServer pid=52696)   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=52696)     return await anext(self.gen)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=52696)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=52696)     return cls(
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=52696)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=52696)     return func(*args, **kwargs)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=52696)     return AsyncMPClient(*client_args)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=52696)     return func(*args, **kwargs)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=52696)     super().__init__(
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=52696)     with launch_core_engines(
(APIServer pid=52696)   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
(APIServer pid=52696)     next(self.gen)
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1128, in launch_core_engines
(APIServer pid=52696)     wait_for_engine_startup(
(APIServer pid=52696)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1187, in wait_for_engine_startup
(APIServer pid=52696)     raise RuntimeError(
(APIServer pid=52696) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

파스칼은 아예 지원 하드웨어에서 빼버린건가?

[링크 : https://docs.vllm.ai/en/latest/features/quantization/]

fork로 이런것도 존재하는데, 아래 pascal-pkgs-ci로 대체 된다고

[링크 : https://github.com/cduk/vllm-pascal]

도커로 시도해야하나..

[링크 : https://github.com/sasha0552/pascal-pkgs-ci]

[링크 : https://github.com/vllm-project/vllm/issues/19542]

위의 도커를 불러오게 하면되려나? 볼륨은 로컬 캐싱에서 HF_HOME 으로 변경해주면 좋을듯

docker run -itd --name gemma4 \
    --ipc=host \
    --network host \
    --shm-size 16G \
    --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    vllm/vllm-openai:latest \
        --model google/gemma-4-31B-it \
        --tensor-parallel-size 2 \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.90 \
        --host 0.0.0.0 \
        --port 8000

[링크 : https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#docker-deployment]

아래와 같이 도커에서 그래픽 카드를 인식하지 못하면

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

nvidia-container-toolkit을 설치하고 도커를 재기동하면 된단다

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker

[링크 : https://bluecolorsky.tistory.com/110]

docker run -itd --name gemma4 \
    --ipc=host \
    --network host \
    --gpus all \
    -v /mnt/huggingface:/root/.cache/huggingface \
    ghcr.io/sasha0552/vllm\
        --model google/gemma-4-e4b-it \
        --tensor-parallel-size 2 \
        --max-model-len 131072\
        --gpu-memory-utilization 0.90 \
        --host 0.0.0.0 \
        --port 8000

[링크 : https://github.com/sasha0552/pascal-pkgs-ci/pkgs/container/vllm]

에라모르겠다 ㅋㅋ

$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fdd1ba582924 ghcr.io/sasha0552/vllm "python3 -m vllm.ent…" 42 seconds ago Exited (1) 26 seconds ago gemma4

[링크 : https://data-newbie.tistory.com/m/1012]

[링크 : https://coding-review.tistory.com/m/608]

2026.05.25

docker run -itd \
  --name gemma4 \
  --ipc=host \
  --network host \
  --gpus all \
  -v /mnt/huggingface:/root/.cache/huggingface \
  ghcr.io/sasha0552/vllm

문제없이 되는거 같으면서도

왜 qwen3-0.6B가 언급이 되지?

INFO 05-25 05:00:09 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=1) INFO 05-25 05:00:11 [api_server.py:1873] vLLM API server version 999.999.999
(APIServer pid=1) INFO 05-25 05:00:11 [utils.py:326] non-default args: {}
(APIServer pid=1) INFO 05-25 05:00:19 [__init__.py:742] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=1) WARNING 05-25 05:00:19 [__init__.py:2828] Your device 'NVIDIA GeForce GTX 1080 Ti' (with compute capability 6.1) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=1) WARNING 05-25 05:00:19 [__init__.py:2879] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 05-25 05:00:19 [__init__.py:1774] Using max model len 40960
(APIServer pid=1) WARNING 05-25 05:00:19 [arg_utils.py:1806] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
(APIServer pid=1) WARNING 05-25 05:00:19 [arg_utils.py:1580] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False.
(APIServer pid=1) INFO 05-25 05:00:20 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 05-25 05:00:20 [api_server.py:295] Started engine process with PID 36
INFO 05-25 05:00:24 [__init__.py:241] Automatically detected platform cuda.
INFO 05-25 05:00:25 [llm_engine.py:222] Initializing a V0 LLM engine (v999.999.999) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{"enable_fusion":false,"enable_noop":false},"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True,
INFO 05-25 05:00:28 [cuda.py:374] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-25 05:00:28 [cuda.py:419] Using XFormers backend.
INFO 05-25 05:00:28 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 05-25 05:00:28 [model_runner.py:1080] Starting to load model Qwen/Qwen3-0.6B...
INFO 05-25 05:00:29 [weight_utils.py:296] Using model weights format ['*.safetensors']
INFO 05-25 05:00:29 [weight_utils.py:349] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.47it/s]

INFO 05-25 05:00:29 [default_loader.py:267] Loading weights took 0.32 seconds
INFO 05-25 05:00:30 [model_runner.py:1112] Model loading took 1.1201 GiB and 1.275699 seconds
INFO 05-25 05:00:31 [worker.py:296] Memory profiling takes 1.07 seconds
INFO 05-25 05:00:31 [worker.py:296] the current vLLM instance can use total_gpu_memory (10.90GiB) x gpu_memory_utilization (0.90) = 9.81GiB
INFO 05-25 05:00:31 [worker.py:296] model weights take 1.12GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 7.26GiB.
INFO 05-25 05:00:31 [executor_base.py:114] # cuda blocks: 4247, # CPU blocks: 2340
INFO 05-25 05:00:31 [executor_base.py:119] Maximum concurrency for 40960 tokens per request: 1.66x
INFO 05-25 05:00:34 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00,  2.85it/s]
INFO 05-25 05:00:46 [model_runner.py:1535] Graph capturing finished in 12 secs, took 0.19 GiB
INFO 05-25 05:00:46 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 16.41 seconds
(APIServer pid=1) INFO 05-25 05:00:46 [api_server.py:1679] Supported_tasks: ['generate']
(APIServer pid=1) WARNING 05-25 05:00:46 [__init__.py:1658] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 05-25 05:00:46 [serving_responses.py:124] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1) INFO 05-25 05:00:47 [serving_chat.py:135] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1) INFO 05-25 05:00:47 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=1) INFO 05-25 05:00:47 [api_server.py:1948] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:36] Available routes are:
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

파스칼 P40 용으로 시도하는데 여전히 안된다. 아놔.. 포기!

[링크 : https://github.com/uaysk/vllm-pascal]

$ vllm serve google/gemma-4-E4B-it
(APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306]
(APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306]        █     █     █▄   ▄█
(APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.21.0
(APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306]   █▄█▀ █     █     █     █  model   google/gemma-4-E4B-it
(APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306]
(APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:240] non-default args: {'model_tag': 'google/gemma-4-E4B-it', 'model': 'google/gemma-4-E4B-it'}
(APIServer pid=65016) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 5.14kB [00:00, 4.19MB/s]
processor_config.json: 1.69kB [00:00, 6.85MB/s]
(APIServer pid=65016) INFO 05-25 21:22:19 [model.py:568] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=65016) WARNING 05-25 21:22:19 [model.py:1982] Your device 'NVIDIA GeForce GTX 1080 Ti' (with compute capability 6.1) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
(APIServer pid=65016) WARNING 05-25 21:22:19 [model.py:2035] Casting torch.bfloat16 to torch.float16.
(APIServer pid=65016) INFO 05-25 21:22:19 [model.py:1697] Using max model len 131072
(APIServer pid=65016) INFO 05-25 21:22:19 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=65016) INFO 05-25 21:22:19 [vllm.py:886] Asynchronous scheduling is enabled.
(APIServer pid=65016) INFO 05-25 21:22:19 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
tokenizer_config.json: 2.10kB [00:00, 2.10MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 11.8MB/s]
chat_template.jinja: 17.3kB [00:00, 13.0MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.66MB/s]
(EngineCore pid=65068) INFO 05-25 21:23:14 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='google/gemma-4-E4B-it', speculative_config=None, tokenizer='google/gemma-4-E4B-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-4-E4B-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=65068) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=65068) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1.
(EngineCore pid=65068) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(EngineCore pid=65068) - 7.5 which supports hardware CC >=7.5,<8.0
(EngineCore pid=65068) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(EngineCore pid=65068) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(EngineCore pid=65068) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=65068) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(EngineCore pid=65068) - 12.0 which supports hardware CC >=12.0,<13.0
(EngineCore pid=65068) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6
(EngineCore pid=65068)   _warn_unsupported_code(d, device_cc, code_ccs)
(EngineCore pid=65068) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1.
(EngineCore pid=65068) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports:
(EngineCore pid=65068) - 7.5 which supports hardware CC >=7.5,<8.0
(EngineCore pid=65068) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7}
(EngineCore pid=65068) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7}
(EngineCore pid=65068) - 9.0 which supports hardware CC >=9.0,<10.0
(EngineCore pid=65068) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1}
(EngineCore pid=65068) - 12.0 which supports hardware CC >=12.0,<13.0
(EngineCore pid=65068) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6
(EngineCore pid=65068)   _warn_unsupported_code(d, device_cc, code_ccs)
(EngineCore pid=65068) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:489: UserWarning:
(EngineCore pid=65068) NVIDIA GeForce GTX 1080 Ti with CUDA capability sm_61 is not compatible with the current PyTorch installation.
(EngineCore pid=65068) The current PyTorch install supports CUDA capabilities sm_75 sm_80 sm_86 sm_90 sm_100 sm_120.
(EngineCore pid=65068) If you want to use the NVIDIA GeForce GTX 1080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
(EngineCore pid=65068)
(EngineCore pid=65068)   queued_call()
(EngineCore pid=65068) INFO 05-25 21:23:20 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.40.238:39589 backend=nccl
(EngineCore pid=65068) INFO 05-25 21:23:20 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=65068) WARNING 05-25 21:23:21 [topk_topp_sampler.py:61] FlashInfer top-p/top-k sampling not supported on compute capability 6.1; falling back to PyTorch-native sampler. Set VLLM_USE_FLASHINFER_SAMPLER=0 to silence.
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] EngineCore failed to start.
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] Traceback (most recent call last):
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     super().__init__(
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self._init_executor()
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.driver_worker.init_device()
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.worker.init_device()  # type: ignore
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.input_batch = InputBatch(
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.block_table = MultiGroupBlockTable(
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.block_tables = [
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp>
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     BlockTable(
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.block_table = self._make_buffer(
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     return CpuGpuBuffer(
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]     self.gpu = torch.zeros_like(self.cpu, device=device)
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140]
(EngineCore pid=65068) Process EngineCore:
(EngineCore pid=65068) Traceback (most recent call last):
(EngineCore pid=65068)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=65068)     self.run()
(EngineCore pid=65068)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore pid=65068)     self._target(*self._args, **self._kwargs)
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1144, in run_engine_core
(EngineCore pid=65068)     raise e
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=65068)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=65068)     return func(*args, **kwargs)
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=65068)     super().__init__(
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=65068)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=65068)     return func(*args, **kwargs)
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=65068)     self._init_executor()
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor
(EngineCore pid=65068)     self.driver_worker.init_device()
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device
(EngineCore pid=65068)     self.worker.init_device()  # type: ignore
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=65068)     return func(*args, **kwargs)
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device
(EngineCore pid=65068)     self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device)
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__
(EngineCore pid=65068)     self.input_batch = InputBatch(
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__
(EngineCore pid=65068)     self.block_table = MultiGroupBlockTable(
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__
(EngineCore pid=65068)     self.block_tables = [
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp>
(EngineCore pid=65068)     BlockTable(
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__
(EngineCore pid=65068)     self.block_table = self._make_buffer(
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer
(EngineCore pid=65068)     return CpuGpuBuffer(
(EngineCore pid=65068)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__
(EngineCore pid=65068)     self.gpu = torch.zeros_like(self.cpu, device=device)
(EngineCore pid=65068) torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
(EngineCore pid=65068) Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore pid=65068) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore pid=65068) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore pid=65068) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=65068)
[rank0]:[W525 21:23:22.743241268 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=65016) Traceback (most recent call last):
(APIServer pid=65016)   File "/home/minimonk/.local/bin/vllm", line 8, in <module>
(APIServer pid=65016)     sys.exit(main())
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=65016)     args.dispatch_function(args)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=65016)     uvloop.run(run_server(args))
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run
(APIServer pid=65016)     return loop.run_until_complete(wrapper())
(APIServer pid=65016)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=65016)     return await main
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 693, in run_server
(APIServer pid=65016)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 707, in run_server_worker
(APIServer pid=65016)     async with build_async_engine_client(
(APIServer pid=65016)   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=65016)     return await anext(self.gen)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=65016)     async with build_async_engine_client_from_engine_args(
(APIServer pid=65016)   File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
(APIServer pid=65016)     return await anext(self.gen)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=65016)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=65016)     return cls(
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=65016)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=65016)     return func(*args, **kwargs)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=65016)     return AsyncMPClient(*client_args)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=65016)     return func(*args, **kwargs)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=65016)     super().__init__(
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=65016)     with launch_core_engines(
(APIServer pid=65016)   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
(APIServer pid=65016)     next(self.gen)
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1128, in launch_core_engines
(APIServer pid=65016)     wait_for_engine_startup(
(APIServer pid=65016)   File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1187, in wait_for_engine_startup
(APIServer pid=65016)     raise RuntimeError(
(APIServer pid=65016) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

저작자표시 (새창열림)

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

telegram bot api 로 기능 변경하기 (0)	2026.05.25
python huggingface 저장경로 변경하기 (0)	2026.05.24
antigravity gemini flash 할당량이 깃털 같구만? (0)	2026.05.22
openclaw agent 구성 관련 (0)	2026.05.22
mxfp4 (0)	2026.05.22

Posted by 구차니

구차니의 잡동사니 모음

vllm 설치, 실행 실패 -> 사실상 포기

'프로그램 사용 > ai 프로그램' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바