$ vllm serve google/gemma-4-E4B-it (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] █ █ █▄ ▄█ (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.21.0 (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] █▄█▀ █ █ █ █ model google/gemma-4-E4B-it (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:306] (APIServer pid=52696) INFO 05-24 22:20:19 [utils.py:240] non-default args: {'model_tag': 'google/gemma-4-E4B-it', 'model': 'google/gemma-4-E4B-it'} (APIServer pid=52696) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. config.json: 5.14kB [00:00, 17.7MB/s] processor_config.json: 1.69kB [00:00, 1.70MB/s] (APIServer pid=52696) INFO 05-24 22:20:29 [model.py:568] Resolved architecture: Gemma4ForConditionalGeneration (APIServer pid=52696) WARNING 05-24 22:20:29 [model.py:1982] Your device 'NVIDIA GeForce GTX 1080 Ti' (with compute capability 6.1) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility. (APIServer pid=52696) WARNING 05-24 22:20:29 [model.py:2035] Casting torch.bfloat16 to torch.float16. (APIServer pid=52696) INFO 05-24 22:20:29 [model.py:1697] Using max model len 131072 (APIServer pid=52696) INFO 05-24 22:20:29 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence. (APIServer pid=52696) INFO 05-24 22:20:29 [vllm.py:886] Asynchronous scheduling is enabled. (APIServer pid=52696) INFO 05-24 22:20:29 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']) tokenizer_config.json: 2.10kB [00:00, 2.05MB/s] tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 14.8MB/s] chat_template.jinja: 17.3kB [00:00, 12.6MB/s] generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.68MB/s] (EngineCore pid=52767) INFO 05-24 22:21:24 [core.py:109] Initializing a V1 LLM engine (v0.21.0) with config: model='google/gemma-4-E4B-it', speculative_config=None, tokenizer='google/gemma-4-E4B-it', skip_t okenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size= 1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=Fal se, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_tr aces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_itera tion_details=False), seed=0, served_model_name=google/gemma-4-E4B-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_w ith_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm ::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_ cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'en coder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_as serts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups ': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 3 20, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_s hapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=False, moe_backend='auto') (EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU0 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1. (EngineCore pid=52767) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports: (EngineCore pid=52767) - 7.5 which supports hardware CC >=7.5,<8.0 (EngineCore pid=52767) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7} (EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} (EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} [108/299] (EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=52767) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1} (EngineCore pid=52767) - 12.0 which supports hardware CC >=12.0,<13.0 (EngineCore pid=52767) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6 (EngineCore pid=52767) _warn_unsupported_code(d, device_cc, code_ccs) (EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:371: UserWarning: Found GPU1 NVIDIA GeForce GTX 1080 Ti which is of compute capability (CC) 6.1. (EngineCore pid=52767) The following list shows the CCs this version of PyTorch was built for and the hardware CCs it supports: (EngineCore pid=52767) - 7.5 which supports hardware CC >=7.5,<8.0 (EngineCore pid=52767) - 8.0 which supports hardware CC >=8.0,<9.0 except {8.7} (EngineCore pid=52767) - 8.6 which supports hardware CC >=8.6,<9.0 except {8.7} (EngineCore pid=52767) - 9.0 which supports hardware CC >=9.0,<10.0 (EngineCore pid=52767) - 10.0 which supports hardware CC >=10.0,<11.0 except {10.1} (EngineCore pid=52767) - 12.0 which supports hardware CC >=12.0,<13.0 (EngineCore pid=52767) Please follow the instructions at https://pytorch.org/get-started/locally/ to install a PyTorch release that supports one of these CUDA versions: 12.6 (EngineCore pid=52767) _warn_unsupported_code(d, device_cc, code_ccs) (EngineCore pid=52767) /home/minimonk/.local/lib/python3.10/site-packages/torch/cuda/__init__.py:489: UserWarning: (EngineCore pid=52767) NVIDIA GeForce GTX 1080 Ti with CUDA capability sm_61 is not compatible with the current PyTorch installation. (EngineCore pid=52767) The current PyTorch install supports CUDA capabilities sm_75 sm_80 sm_86 sm_90 sm_100 sm_120. (EngineCore pid=52767) If you want to use the NVIDIA GeForce GTX 1080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ (EngineCore pid=52767) (EngineCore pid=52767) queued_call() (EngineCore pid=52767) INFO 05-24 22:21:30 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.40.238:47913 backend=nccl (EngineCore pid=52767) INFO 05-24 22:21:30 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore pid=52767) WARNING 05-24 22:21:31 [topk_topp_sampler.py:61] FlashInfer top-p/top-k sampling not supported on compute capability 6.1; falling back to PyTorch-native sampler. Set VLLM_USE_FLASHINF ER_SAMPLER=0 to silence. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] EngineCore failed to start. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Traceback (most recent call last): (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] super().__init__( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.model_executor = executor_class(vllm_config) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self._init_executor() (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.driver_worker.init_device() (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.worker.init_device() # type: ignore (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.input_batch = InputBatch( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.block_table = MultiGroupBlockTable( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.block_tables = [ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] BlockTable( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> [54/299] (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] BlockTable( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.block_table = self._make_buffer( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] return CpuGpuBuffer( (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__ (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] self.gpu = torch.zeros_like(self.cpu, device=device) (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=52767) ERROR 05-24 22:21:32 [core.py:1140] (EngineCore pid=52767) Process EngineCore: (EngineCore pid=52767) Traceback (most recent call last): (EngineCore pid=52767) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=52767) self.run() (EngineCore pid=52767) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore pid=52767) self._target(*self._args, **self._kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1144, in run_engine_core (EngineCore pid=52767) raise e (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=52767) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) return func(*args, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 880, in __init__ (EngineCore pid=52767) super().__init__( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 118, in __init__ (EngineCore pid=52767) self.model_executor = executor_class(vllm_config) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) return func(*args, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 109, in __init__ (EngineCore pid=52767) self._init_executor() (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 60, in _init_executor (EngineCore pid=52767) self.driver_worker.init_device() (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py", line 317, in init_device (EngineCore pid=52767) self.worker.init_device() # type: ignore (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=52767) return func(*args, **kwargs) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 330, in init_device (EngineCore pid=52767) self.model_runner = GPUModelRunnerV1(self.vllm_config, self.device) (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 629, in __init__ (EngineCore pid=52767) self.input_batch = InputBatch( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/gpu_input_batch.py", line 171, in __init__ (EngineCore pid=52767) self.block_table = MultiGroupBlockTable( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 267, in __init__ (EngineCore pid=52767) self.block_tables = [ (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 268, in <listcomp> (EngineCore pid=52767) BlockTable( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 70, in __init__ (EngineCore pid=52767) self.block_table = self._make_buffer( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/worker/block_table.py", line 218, in _make_buffer (EngineCore pid=52767) return CpuGpuBuffer( (EngineCore pid=52767) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/utils.py", line 120, in __init__ (EngineCore pid=52767) self.gpu = torch.zeros_like(self.cpu, device=device) (EngineCore pid=52767) torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device (EngineCore pid=52767) Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=52767) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=52767) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=52767) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore pid=52767) [rank0]:[W524 22:21:32.199264489 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch. org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=52696) Traceback (most recent call last): (APIServer pid=52696) File "/home/minimonk/.local/bin/vllm", line 8, in <module> (APIServer pid=52696) sys.exit(main()) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 92, in main (APIServer pid=52696) args.dispatch_function(args) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=52696) uvloop.run(run_server(args)) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 69, in run (APIServer pid=52696) return loop.run_until_complete(wrapper()) (APIServer pid=52696) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/uvloop/__init__.py", line 48, in wrapper (APIServer pid=52696) return await main (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 693, in run_server (APIServer pid=52696) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 707, in run_server_worker (APIServer pid=52696) async with build_async_engine_client( (APIServer pid=52696) File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=52696) return await anext(self.gen) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=52696) async with build_async_engine_client_from_engine_args( (APIServer pid=52696) File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ (APIServer pid=52696) return await anext(self.gen) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=52696) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config (APIServer pid=52696) return cls( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 146, in __init__ (APIServer pid=52696) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=52696) return func(*args, **kwargs) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client (APIServer pid=52696) return AsyncMPClient(*client_args) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=52696) return func(*args, **kwargs) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 900, in __init__ (APIServer pid=52696) super().__init__( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 535, in __init__ (APIServer pid=52696) with launch_core_engines( (APIServer pid=52696) File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ (APIServer pid=52696) next(self.gen) (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1128, in launch_core_engines (APIServer pid=52696) wait_for_engine_startup( (APIServer pid=52696) File "/home/minimonk/.local/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1187, in wait_for_engine_startup (APIServer pid=52696) raise RuntimeError( (APIServer pid=52696) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} |