vllm 설치, 실행 실패 -> 사실상 포기
실행하라면 이렇게 하라고 하는데
| vllm serve google/gemma-4-E4B-it \ --max-model-len <n_of_tokens> # up to 131072 |
[링크 : https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html]
아따.. 드럽게 크네. 그나저나 허깅페이스에서 바로 받으려나?
그리고 gguf가 양자화 되서 작은거였나. 기존에 내가 쓰던데 Q4_K_M 이라 4.7기가 정도 되었는데
model.safetensors는 16기가나 된다. 와우

[링크 : https://huggingface.co/google/gemma-4-E4B]
vllm : 1080 ti 라니 불량식품이잖아! 퉤!
| $ vllm serve google/gemma-4-E4B-it (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] █ █ █▄ ▄█ (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.21.0 (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] █▄█▀ █ █ █ █ model google/gemma-4-E4B-it (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:240] non-default args: {'model_tag': 'google/gemma-4-E4B-it', 'model': 'google/gemma-4-E4B-it'} (APIServer pid=52696) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. config.json: 5.14kB [00:00, 17.7MB/s] processor_config.json: 1.69kB [00:00, 1.70MB/s] (APIServer pid=52696) INFO 05-24 22:20:29 [model.py:568] Resolved architecture: Gemma4ForConditionalGeneration (APIServer pid=52696) WARNING 05-24 22:20:29 [model.py:1982] Your device 'NVIDIA GeForce GTX 1080 Ti' (with compute capability 6.1) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility. (APIServer pid=52696) WARNING 05-24 22:20:29 [model.py:2035] Casting torch.bfloat16 to torch.float16. (APIServer pid=52696) INFO 05-24 22:20:29 [model.py:1697] Using max model len 131072 (APIServer pid=52696) INFO 05-24 22:20:29 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence. (APIServer pid=52696) INFO 05-24 22:20:29 [vllm.py:886] Asynchronous scheduling is enabled. (APIServer pid=52696) INFO 05-24 22:20:29 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']) tokenizer_config.json: 2.10kB [00:00, 2.05MB/s] tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 14.8MB/s] chat_template.jinja: 17.3kB [00:00, 12.6MB/s] generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.68MB/s] (EngineCore pid=52767) INFO 05-24 22:21:24 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='google/gemma-4-E4B-it', speculative_config=None, tokenizer='google/gemma-4-E4B-it', skip_t okenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size= 1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=Fal se, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_tr aces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_itera tion_details=False), seed=0, served_model_name=google/gemma-4-E4B-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_w ith_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm ::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_ cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'en coder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_as serts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups ': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 3 20, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_s hapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto') (EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1. (EngineCore pid=52767) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports: (EngineCore pid=52767) - 7.5 which supports hardware CC >=7.5,<8.0 (EngineCore pid=52767) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7} (EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} (EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} [108/299] (EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=52767) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1} (EngineCore pid=52767) - 12.0 which supports hardware CC >=12.0,<13.0 (EngineCore pid=52767) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6 (EngineCore pid=52767) _warn_unsupported_code(d, device_cc, code_ccs) (EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1. (EngineCore pid=52767) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports: (EngineCore pid=52767) - 7.5 which supports hardware CC >=7.5,<8.0 (EngineCore pid=52767) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7} (EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} (EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=52767) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1} (EngineCore pid=52767) - 12.0 which supports hardware CC >=12.0,<13.0 (EngineCore pid=52767) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6 (EngineCore pid=52767) _warn_unsupported_code(d, device_cc, code_ccs) (EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:489: UserWarning: (EngineCore pid=52767) NVIDIA GeForce GTX 1080 Ti with CUDA capability sm_61 is not compatible with the current PyTorch installation. (EngineCore pid=52767) The current PyTorch install supports CUDA capabilities sm_75 sm_80 sm_86 sm_90 sm_100 sm_120. (EngineCore pid=52767) If you want to use the NVIDIA GeForce GTX 1080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ (EngineCore pid=52767) (EngineCore pid=52767) queued_call() (EngineCore pid=52767) INFO 05-24 22:21:30 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.40.238:47913 backend=nccl (EngineCore pid=52767) INFO 05-24 22:21:30 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=52767) WARNING 05-24 22:21:31 [topk_topp_sampler.py:61] FlashInfer top-p/top-k sampling not supported on compute capability 6.1; falling back to PyTorch-native sampler. Set VLLM_USE_FLASHINF ER_SAMPLER=0 to silence. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] EngineCore failed to start. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Traceback (most recent call last): (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] super().__init__( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.model_executor = executor_class(vllm_config) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self._init_executor() (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.driver_worker.init_device() (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.worker.init_device() # type: ignore (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.input_batch = InputBatch( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.block_table = MultiGroupBlockTable( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.block_tables = [ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] BlockTable( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> [54/299] (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] BlockTable( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.block_table = self._make_buffer( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return CpuGpuBuffer( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.gpu = torch.zeros_like(self.cpu, device=device) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] (EngineCore pid=52767) Process EngineCore: (EngineCore pid=52767) Traceback (most recent call last): (EngineCore pid=52767) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=52767) self.run() (EngineCore pid=52767) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore pid=52767) self._target(*self._args, **self._kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1144, in run_engine_core (EngineCore pid=52767) raise e (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=52767) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) return func(*args, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__ (EngineCore pid=52767) super().__init__( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__ (EngineCore pid=52767) self.model_executor = executor_class(vllm_config) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) return func(*args, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__ (EngineCore pid=52767) self._init_executor() (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor (EngineCore pid=52767) self.driver_worker.init_device() (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device (EngineCore pid=52767) self.worker.init_device() # type: ignore (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) return func(*args, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device (EngineCore pid=52767) self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__ (EngineCore pid=52767) self.input_batch = InputBatch( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__ (EngineCore pid=52767) self.block_table = MultiGroupBlockTable( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__ (EngineCore pid=52767) self.block_tables = [ (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> (EngineCore pid=52767) BlockTable( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__ (EngineCore pid=52767) self.block_table = self._make_buffer( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer (EngineCore pid=52767) return CpuGpuBuffer( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__ (EngineCore pid=52767) self.gpu = torch.zeros_like(self.cpu, device=device) (EngineCore pid=52767) torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device (EngineCore pid=52767) Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=52767) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=52767) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=52767) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=52767) [rank0]:[W524 22:21:32.199264489 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch. org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=52696) Traceback (most recent call last): (APIServer pid=52696) File "/home/minimonk/.local/bin/vllm", line 8, in <module> (APIServer pid=52696) sys.exit(main()) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 92, in main (APIServer pid=52696) args.dispatch_function(args) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=52696) uvloop.run(run_server(args)) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run (APIServer pid=52696) return loop.run_until_complete(wrapper()) (APIServer pid=52696) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=52696) return await main (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 693, in run_server (APIServer pid=52696) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 707, in run_server_worker (APIServer pid=52696) async with build_async_engine_client( (APIServer pid=52696) File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=52696) return await anext(self.gen) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=52696) async with build_async_engine_client_from_engine_args( (APIServer pid=52696) File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=52696) return await anext(self.gen) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=52696) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config (APIServer pid=52696) return cls( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 146, in __init__ (APIServer pid=52696) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=52696) return func(*args, **kwargs) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client (APIServer pid=52696) return AsyncMPClient(*client_args) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=52696) return func(*args, **kwargs) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 900, in __init__ (APIServer pid=52696) super().__init__( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__ (APIServer pid=52696) with launch_core_engines( (APIServer pid=52696) File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ (APIServer pid=52696) next(self.gen) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1128, in launch_core_engines (APIServer pid=52696) wait_for_engine_startup( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1187, in wait_for_engine_startup (APIServer pid=52696) raise RuntimeError( (APIServer pid=52696) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} |
파스칼은 아예 지원 하드웨어에서 빼버린건가?

[링크 : https://docs.vllm.ai/en/latest/features/quantization/]
fork로 이런것도 존재하는데, 아래 pascal-pkgs-ci로 대체 된다고
[링크 : https://github.com/cduk/vllm-pascal]
도커로 시도해야하나..
[링크 : https://github.com/sasha0552/pascal-pkgs-ci]
[링크 : https://github.com/vllm-project/vllm/issues/19542]
위의 도커를 불러오게 하면되려나? 볼륨은 로컬 캐싱에서 HF_HOME 으로 변경해주면 좋을듯
| docker run -itd --name gemma4 \ --ipc=host \ --network host \ --shm-size 16G \ --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ vllm/vllm-openai:latest \ --model google/gemma-4-31B-it \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --gpu-memory-utilization 0.90 \ --host 0.0.0.0 \ --port 8000 |
[링크 : https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#docker-deployment]
아래와 같이 도커에서 그래픽 카드를 인식하지 못하면
| docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] |
nvidia-container-toolkit을 설치하고 도커를 재기동하면 된단다
| $ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list $ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit $ sudo systemctl restart docker |
[링크 : https://bluecolorsky.tistory.com/110]
| docker run -itd --name gemma4 \ --ipc=host \ --network host \ --gpus all \ -v /mnt/huggingface:/root/.cache/huggingface \ ghcr.io/sasha0552/vllm\ --model google/gemma-4-e4b-it \ --tensor-parallel-size 2 \ --max-model-len 131072\ --gpu-memory-utilization 0.90 \ --host 0.0.0.0 \ --port 8000 |
[링크 : https://github.com/sasha0552/pascal-pkgs-ci/pkgs/container/vllm]
에라모르겠다 ㅋㅋ
| $ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES fdd1ba582924 ghcr.io/sasha0552/vllm "python3 -m vllm.ent…" 42 seconds ago Exited (1) 26 seconds ago gemma4 |
[링크 : https://data-newbie.tistory.com/m/1012]
[링크 : https://coding-review.tistory.com/m/608]
+
2026.05.25
| docker run -itd \ --name gemma4 \ --ipc=host \ --network host \ --gpus all \ -v /mnt/huggingface:/root/.cache/huggingface \ ghcr.io/sasha0552/vllm |
문제없이 되는거 같으면서도
왜 qwen3-0.6B가 언급이 되지?
| INFO 05-25 05:00:09 [__init__.py:241] Automatically detected platform cuda. (APIServer pid=1) INFO 05-25 05:00:11 [api_server.py:1873] vLLM API server version 999.999.999 (APIServer pid=1) INFO 05-25 05:00:11 [utils.py:326] non-default args: {} (APIServer pid=1) INFO 05-25 05:00:19 [__init__.py:742] Resolved architecture: Qwen3ForCausalLM (APIServer pid=1) WARNING 05-25 05:00:19 [__init__.py:2828] Your device 'NVIDIA GeForce GTX 1080 Ti' (with compute capability 6.1) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility. (APIServer pid=1) WARNING 05-25 05:00:19 [__init__.py:2879] Casting torch.bfloat16 to torch.float16. (APIServer pid=1) INFO 05-25 05:00:19 [__init__.py:1774] Using max model len 40960 (APIServer pid=1) WARNING 05-25 05:00:19 [arg_utils.py:1806] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. (APIServer pid=1) WARNING 05-25 05:00:19 [arg_utils.py:1580] Chunked prefill is enabled by default for models with max_model_len > 32K. Chunked prefill might not work with some features or models. If you encounter any issues, please disable by launching with --enable-chunked-prefill=False. (APIServer pid=1) INFO 05-25 05:00:20 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=1) INFO 05-25 05:00:20 [api_server.py:295] Started engine process with PID 36 INFO 05-25 05:00:24 [__init__.py:241] Automatically detected platform cuda. INFO 05-25 05:00:25 [llm_engine.py:222] Initializing a V0 LLM engine (v999.999.999) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-0.6B, enable_prefix_caching=None, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{"enable_fusion":false,"enable_noop":false},"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True, INFO 05-25 05:00:28 [cuda.py:374] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 05-25 05:00:28 [cuda.py:419] Using XFormers backend. INFO 05-25 05:00:28 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 INFO 05-25 05:00:28 [model_runner.py:1080] Starting to load model Qwen/Qwen3-0.6B... INFO 05-25 05:00:29 [weight_utils.py:296] Using model weights format ['*.safetensors'] INFO 05-25 05:00:29 [weight_utils.py:349] No model.safetensors.index.json found in remote. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.47it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.47it/s] INFO 05-25 05:00:29 [default_loader.py:267] Loading weights took 0.32 seconds INFO 05-25 05:00:30 [model_runner.py:1112] Model loading took 1.1201 GiB and 1.275699 seconds INFO 05-25 05:00:31 [worker.py:296] Memory profiling takes 1.07 seconds INFO 05-25 05:00:31 [worker.py:296] the current vLLM instance can use total_gpu_memory (10.90GiB) x gpu_memory_utilization (0.90) = 9.81GiB INFO 05-25 05:00:31 [worker.py:296] model weights take 1.12GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 7.26GiB. INFO 05-25 05:00:31 [executor_base.py:114] # cuda blocks: 4247, # CPU blocks: 2340 INFO 05-25 05:00:31 [executor_base.py:119] Maximum concurrency for 40960 tokens per request: 1.66x INFO 05-25 05:00:34 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00, 2.85it/s] INFO 05-25 05:00:46 [model_runner.py:1535] Graph capturing finished in 12 secs, took 0.19 GiB INFO 05-25 05:00:46 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 16.41 seconds (APIServer pid=1) INFO 05-25 05:00:46 [api_server.py:1679] Supported_tasks: ['generate'] (APIServer pid=1) WARNING 05-25 05:00:46 [__init__.py:1658] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. (APIServer pid=1) INFO 05-25 05:00:46 [serving_responses.py:124] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95} (APIServer pid=1) INFO 05-25 05:00:47 [serving_chat.py:135] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95} (APIServer pid=1) INFO 05-25 05:00:47 [serving_completion.py:77] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95} (APIServer pid=1) INFO 05-25 05:00:47 [api_server.py:1948] Starting vLLM API server 0 on http://0.0.0.0:8000 (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:36] Available routes are: (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /docs, Methods: HEAD, GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /redoc, Methods: HEAD, GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /health, Methods: GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /load, Methods: GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /ping, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /ping, Methods: GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /version, Methods: GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/embeddings, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /pooling, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /classify, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /score, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/score, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/audio/translations, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /rerank, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v1/rerank, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /v2/rerank, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /invocations, Methods: POST (APIServer pid=1) INFO 05-25 05:00:47 [launcher.py:44] Route: /metrics, Methods: GET (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. |
파스칼 P40 용으로 시도하는데 여전히 안된다. 아놔.. 포기!
[링크 : https://github.com/uaysk/vllm-pascal]
| $ vllm serve google/gemma-4-E4B-it (APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306] (APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306] █ █ █▄ ▄█ (APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.21.0 (APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306] █▄█▀ █ █ █ █ model google/gemma-4-E4B-it (APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:306] (APIServer pid=65016) INFO 05-25 21:22:17 [utils.py:240] non-default args: {'model_tag': 'google/gemma-4-E4B-it', 'model': 'google/gemma-4-E4B-it'} (APIServer pid=65016) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. config.json: 5.14kB [00:00, 4.19MB/s] processor_config.json: 1.69kB [00:00, 6.85MB/s] (APIServer pid=65016) INFO 05-25 21:22:19 [model.py:568] Resolved architecture: Gemma4ForConditionalGeneration (APIServer pid=65016) WARNING 05-25 21:22:19 [model.py:1982] Your device 'NVIDIA GeForce GTX 1080 Ti' (with compute capability 6.1) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility. (APIServer pid=65016) WARNING 05-25 21:22:19 [model.py:2035] Casting torch.bfloat16 to torch.float16. (APIServer pid=65016) INFO 05-25 21:22:19 [model.py:1697] Using max model len 131072 (APIServer pid=65016) INFO 05-25 21:22:19 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence. (APIServer pid=65016) INFO 05-25 21:22:19 [vllm.py:886] Asynchronous scheduling is enabled. (APIServer pid=65016) INFO 05-25 21:22:19 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']) tokenizer_config.json: 2.10kB [00:00, 2.10MB/s] tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 11.8MB/s] chat_template.jinja: 17.3kB [00:00, 13.0MB/s] generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.66MB/s] (EngineCore pid=65068) INFO 05-25 21:23:14 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='google/gemma-4-E4B-it', speculative_config=None, tokenizer='google/gemma-4-E4B-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-4-E4B-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto') (EngineCore pid=65068) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. (EngineCore pid=65068) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1. (EngineCore pid=65068) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports: (EngineCore pid=65068) - 7.5 which supports hardware CC >=7.5,<8.0 (EngineCore pid=65068) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7} (EngineCore pid=65068) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} (EngineCore pid=65068) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=65068) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1} (EngineCore pid=65068) - 12.0 which supports hardware CC >=12.0,<13.0 (EngineCore pid=65068) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6 (EngineCore pid=65068) _warn_unsupported_code(d, device_cc, code_ccs) (EngineCore pid=65068) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1. (EngineCore pid=65068) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports: (EngineCore pid=65068) - 7.5 which supports hardware CC >=7.5,<8.0 (EngineCore pid=65068) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7} (EngineCore pid=65068) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} (EngineCore pid=65068) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=65068) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1} (EngineCore pid=65068) - 12.0 which supports hardware CC >=12.0,<13.0 (EngineCore pid=65068) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6 (EngineCore pid=65068) _warn_unsupported_code(d, device_cc, code_ccs) (EngineCore pid=65068) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:489: UserWarning: (EngineCore pid=65068) NVIDIA GeForce GTX 1080 Ti with CUDA capability sm_61 is not compatible with the current PyTorch installation. (EngineCore pid=65068) The current PyTorch install supports CUDA capabilities sm_75 sm_80 sm_86 sm_90 sm_100 sm_120. (EngineCore pid=65068) If you want to use the NVIDIA GeForce GTX 1080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ (EngineCore pid=65068) (EngineCore pid=65068) queued_call() (EngineCore pid=65068) INFO 05-25 21:23:20 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.40.238:39589 backend=nccl (EngineCore pid=65068) INFO 05-25 21:23:20 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=65068) WARNING 05-25 21:23:21 [topk_topp_sampler.py:61] FlashInfer top-p/top-k sampling not supported on compute capability 6.1; falling back to PyTorch-native sampler. Set VLLM_USE_FLASHINFER_SAMPLER=0 to silence. (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] EngineCore failed to start. (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] Traceback (most recent call last): (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] super().__init__( (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.model_executor = executor_class(vllm_config) (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self._init_executor() (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.driver_worker.init_device() (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.worker.init_device() # type: ignore (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device) (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.input_batch = InputBatch( (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.block_table = MultiGroupBlockTable( (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.block_tables = [ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] BlockTable( (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.block_table = self._make_buffer( (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] return CpuGpuBuffer( (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__ (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] self.gpu = torch.zeros_like(self.cpu, device=device) (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=65068) ERROR 05-25 21:23:21 [core.py:1140] (EngineCore pid=65068) Process EngineCore: (EngineCore pid=65068) Traceback (most recent call last): (EngineCore pid=65068) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=65068) self.run() (EngineCore pid=65068) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore pid=65068) self._target(*self._args, **self._kwargs) (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1144, in run_engine_core (EngineCore pid=65068) raise e (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=65068) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=65068) return func(*args, **kwargs) (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__ (EngineCore pid=65068) super().__init__( (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__ (EngineCore pid=65068) self.model_executor = executor_class(vllm_config) (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=65068) return func(*args, **kwargs) (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__ (EngineCore pid=65068) self._init_executor() (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor (EngineCore pid=65068) self.driver_worker.init_device() (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device (EngineCore pid=65068) self.worker.init_device() # type: ignore (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=65068) return func(*args, **kwargs) (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device (EngineCore pid=65068) self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device) (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__ (EngineCore pid=65068) self.input_batch = InputBatch( (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__ (EngineCore pid=65068) self.block_table = MultiGroupBlockTable( (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__ (EngineCore pid=65068) self.block_tables = [ (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> (EngineCore pid=65068) BlockTable( (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__ (EngineCore pid=65068) self.block_table = self._make_buffer( (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer (EngineCore pid=65068) return CpuGpuBuffer( (EngineCore pid=65068) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__ (EngineCore pid=65068) self.gpu = torch.zeros_like(self.cpu, device=device) (EngineCore pid=65068) torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device (EngineCore pid=65068) Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=65068) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=65068) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=65068) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=65068) [rank0]:[W525 21:23:22.743241268 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=65016) Traceback (most recent call last): (APIServer pid=65016) File "/home/minimonk/.local/bin/vllm", line 8, in <module> (APIServer pid=65016) sys.exit(main()) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 92, in main (APIServer pid=65016) args.dispatch_function(args) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=65016) uvloop.run(run_server(args)) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run (APIServer pid=65016) return loop.run_until_complete(wrapper()) (APIServer pid=65016) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=65016) return await main (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 693, in run_server (APIServer pid=65016) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 707, in run_server_worker (APIServer pid=65016) async with build_async_engine_client( (APIServer pid=65016) File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=65016) return await anext(self.gen) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=65016) async with build_async_engine_client_from_engine_args( (APIServer pid=65016) File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=65016) return await anext(self.gen) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=65016) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config (APIServer pid=65016) return cls( (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 146, in __init__ (APIServer pid=65016) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=65016) return func(*args, **kwargs) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client (APIServer pid=65016) return AsyncMPClient(*client_args) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=65016) return func(*args, **kwargs) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 900, in __init__ (APIServer pid=65016) super().__init__( (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__ (APIServer pid=65016) with launch_core_engines( (APIServer pid=65016) File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ (APIServer pid=65016) next(self.gen) (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1128, in launch_core_engines (APIServer pid=65016) wait_for_engine_startup( (APIServer pid=65016) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1187, in wait_for_engine_startup (APIServer pid=65016) raise RuntimeError( (APIServer pid=65016) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} |