Vllm Whisper

8 minute read

vllm

curl -LsSf https://astral.sh/uv/install.sh

source $HOME/.cargo/env

sudo apt install python3 python3-pip

uv venv –python 3.12 –seed

source .venv/bin/activate

source .venv/bin/activate && uv pip install “vllm[audio]”

uv pip install vllm –torch-backend=auto

source .venv/bin/activate && vllm serve openai/whisper-large-v3 –host 0.0.0.0 –port 8000 –gpu-memory-utilization 0.85

Log

rechenknecht@rechenknecht:/mnt/c/src/github/wirus_bb_main/vllmWhisper$ source .venv/bin/activate && vllm serve openai/whisper-large-v3 –host 0.0.0.0 –port 8000 –gpu-memory-utilization 0.85 (APIServer pid=11980) INFO 02-12 19:33:39 [utils.py:325] (APIServer pid=11980) INFO 02-12 19:33:39 [utils.py:325] █ █ █▄ ▄█ (APIServer pid=11980) INFO 02-12 19:33:39 [utils.py:325] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.15.1 (APIServer pid=11980) INFO 02-12 19:33:39 [utils.py:325] █▄█▀ █ █ █ █ model openai/whisper-large-v3 (APIServer pid=11980) INFO 02-12 19:33:39 [utils.py:325] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=11980) INFO 02-12 19:33:39 [utils.py:325] (APIServer pid=11980) INFO 02-12 19:33:39 [utils.py:261] non-default args: {‘model_tag’: ‘openai/whisper-large-v3’, ‘api_server_count’: 1, ‘host’: ‘0.0.0.0’, ‘model’: ‘openai/whisper-large-v3’, ‘gpu_memory_utilization’: 0.85} (APIServer pid=11980) INFO 02-12 19:33:40 [model.py:541] Resolved architecture: WhisperForConditionalGeneration (APIServer pid=11980) INFO 02-12 19:33:40 [model.py:1561] Using max model len 448 (APIServer pid=11980) INFO 02-12 19:33:40 [model.py:575] Encoder-decoder model detected, disabling mm processor cache. (APIServer pid=11980) INFO 02-12 19:33:41 [scheduler.py:217] Encoder-decoder models do not support chunked prefill nor prefix caching; disabling both. (APIServer pid=11980) INFO 02-12 19:33:41 [vllm.py:624] Asynchronous scheduling is enabled. (APIServer pid=11980) INFO 02-12 19:33:41 [vllm.py:751] Encoder-decoder models do not support FULL_AND_PIECEWISE. Overriding cudagraph_mode to FULL_DECODE_ONLY. (APIServer pid=11980) WARNING 02-12 19:33:41 [vllm.py:905] No piecewise cudagraph for executing cascade attention. Will fall back to eager execution if a batch runs into cascade attentions. (APIServer pid=11980) WARNING 02-12 19:33:44 [interface.py:470] Using ‘pin_memory=False’ as WSL is detected. This may slow down the performance. (EngineCore_DP0 pid=12321) INFO 02-12 19:34:53 [core.py:96] Initializing a V1 LLM engine (v0.15.1) with config: model=’openai/whisper-large-v3’, speculative_config=None, tokenizer=’openai/whisper-large-v3’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=’auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=’’, reasoning_parser_plugin=’’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=openai/whisper-large-v3, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={‘level’: None, ‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’], ‘compile_mm_encoder’: False, ‘compile_sizes’: [], ‘compile_ranges_split_points’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_DECODE_ONLY: (2, 0)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘eliminate_noops’: True, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: True}, ‘local_cache_dir’: None, ‘static_all_moe_layers’: []} (EngineCore_DP0 pid=12321) INFO 02-12 19:34:56 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.29.9.184:48511 backend=nccl (EngineCore_DP0 pid=12321) INFO 02-12 19:34:56 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A (EngineCore_DP0 pid=12321) WARNING 02-12 19:34:56 [interface.py:470] Using ‘pin_memory=False’ as WSL is detected. This may slow down the performance. (EngineCore_DP0 pid=12321) INFO 02-12 19:35:05 [gpu_model_runner.py:4033] Starting to load model openai/whisper-large-v3… (EngineCore_DP0 pid=12321) INFO 02-12 19:35:05 [mm_encoder_attention.py:77] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (EngineCore_DP0 pid=12321) /mnt/c/src/github/wirus_bb_main/vllmWhisper/.venv/lib/python3.12/site-packages/tvm_ffi/_optional_torch_c_dlpack.py:174: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled. (EngineCore_DP0 pid=12321) We recommend installing via pip install torch-c-dlpack-ext (EngineCore_DP0 pid=12321) warnings.warn( (EngineCore_DP0 pid=12321) INFO 02-12 19:35:29 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: (‘FLASH_ATTN’, ‘FLASHINFER’, ‘TRITON_ATTN’, ‘FLEX_ATTENTION’) (EngineCore_DP0 pid=12321) INFO 02-12 19:35:29 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: (‘FLASH_ATTN’, ‘TRITON_ATTN’) (EngineCore_DP0 pid=12321) INFO 02-12 19:35:30 [weight_utils.py:567] No model.safetensors.index.json found in remote. Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00?, ?it/s] Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.20it/s] Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.12it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.19it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:02<00:00, 1.18it/s] (EngineCore_DP0 pid=12321) (EngineCore_DP0 pid=12321) INFO 02-12 19:35:33 [default_loader.py:291] Loading weights took 2.70 seconds (EngineCore_DP0 pid=12321) INFO 02-12 19:35:33 [gpu_model_runner.py:4130] Model loading took 2.88 GiB memory and 27.944203 seconds (EngineCore_DP0 pid=12321) INFO 02-12 19:35:34 [gpu_model_runner.py:4958] Encoder cache will be initialized with a budget of 2048 tokens, and profiled with 1 audio items of the maximum feature size. (EngineCore_DP0 pid=12321) WARNING 02-12 19:35:34 [context.py:472] WhisperProcessor did not return BatchFeature. Make sure to match the behaviour of ProcessorMixin when implementing custom processors. (EngineCore_DP0 pid=12321) INFO 02-12 19:35:51 [backends.py:812] Using cache directory: /home/rechenknecht/.cache/vllm/torch_compile_cache/6715ce382d/rank_0_0/backbone for vLLM’s torch.compile (EngineCore_DP0 pid=12321) INFO 02-12 19:35:51 [backends.py:872] Dynamo bytecode transform time: 16.41 s (EngineCore_DP0 pid=12321) INFO 02-12 19:35:56 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 2.089 s (EngineCore_DP0 pid=12321) INFO 02-12 19:35:56 [monitor.py:34] torch.compile takes 18.50 s in total (EngineCore_DP0 pid=12321) INFO 02-12 19:35:57 [gpu_worker.py:356] Available KV cache memory: 10.14 GiB (EngineCore_DP0 pid=12321) INFO 02-12 19:35:57 [kv_cache_utils.py:1307] GPU KV cache size: 33,216 tokens (EngineCore_DP0 pid=12321) INFO 02-12 19:35:57 [kv_cache_utils.py:1312] Maximum concurrency for 448 tokens per request: 26.62x Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████| 35/35 [00:02<00:00, 13.85it/s] (EngineCore_DP0 pid=12321) INFO 02-12 19:36:00 [gpu_model_runner.py:5063] Graph capturing finished in 3 secs, took 0.22 GiB (EngineCore_DP0 pid=12321) INFO 02-12 19:36:00 [core.py:272] init engine (profile, create kv cache, warmup model) took 26.28 seconds (EngineCore_DP0 pid=12321) INFO 02-12 19:36:00 [vllm.py:624] Asynchronous scheduling is enabled. (EngineCore_DP0 pid=12321) WARNING 02-12 19:36:00 [vllm.py:905] No piecewise cudagraph for executing cascade attention. Will fall back to eager execution if a batch runs into cascade attentions. (APIServer pid=11980) INFO 02-12 19:36:01 [api_server.py:665] Supported tasks: [‘transcription’] (APIServer pid=11980) INFO 02-12 19:36:04 [speech_to_text.py:142] Warming up audio preprocessing libraries… (APIServer pid=11980) INFO 02-12 19:36:30 [speech_to_text.py:178] Audio preprocessing warmup completed in 26.92s (APIServer pid=11980) INFO 02-12 19:36:30 [speech_to_text.py:205] Warming up multimodal input processor… (APIServer pid=11980) INFO 02-12 19:36:35 [speech_to_text.py:238] Input processor warmup completed in 4.75s (APIServer pid=11980) INFO 02-12 19:36:36 [speech_to_text.py:142] Warming up audio preprocessing libraries… (APIServer pid=11980) INFO 02-12 19:36:36 [speech_to_text.py:178] Audio preprocessing warmup completed in 0.00s (APIServer pid=11980) INFO 02-12 19:36:36 [speech_to_text.py:205] Warming up multimodal input processor… (APIServer pid=11980) WARNING 02-12 19:36:36 [context.py:472] WhisperProcessor did not return BatchFeature. Make sure to match the behaviour of ProcessorMixin when implementing custom processors. (APIServer pid=11980) INFO 02-12 19:36:36 [speech_to_text.py:238] Input processor warmup completed in 0.00s (APIServer pid=11980) INFO 02-12 19:36:36 [api_server.py:946] Starting vLLM API server 0 on http://0.0.0.0:8000 (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:38] Available routes are: (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /docs, Methods: HEAD, GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /redoc, Methods: HEAD, GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /tokenize, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /detokenize, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /inference/v1/generate, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /pause, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /resume, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /is_paused, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /metrics, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /health, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/chat/completions, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/responses, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/audio/translations, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/completions, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/completions/render, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/messages, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/models, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /load, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /version, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /ping, Methods: GET (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /ping, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /invocations, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /classify, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/embeddings, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /score, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/score, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /rerank, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v1/rerank, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /v2/rerank, Methods: POST (APIServer pid=11980) INFO 02-12 19:36:36 [launcher.py:46] Route: /pooling, Methods: POST (APIServer pid=11980) INFO: Started server process [11980] (APIServer pid=11980) INFO: Waiting for application startup. (APIServer pid=11980) INFO: Application startup complete.

Client

curl -s http://localhost:8000/v1/audio/transcriptions
-F “model=openai/whisper-large-v3”
-F “file=@/path/to/audio.wav”
-F “language=en”
-F “response_format=verbose_json”

curl -s http://rechenknecht:8000/v1/audio/transcriptions
-F “model=openai/whisper-medium”
-F “file=file_8—ff54563f-5bc5-4e08-bd4c-487c0f058fc6.ogg”
-F “language=de”
-F “response_format=verbose_json”

vllm

curl -LsSf https://astral.sh/uv/install.sh

source $HOME/.cargo/env

sudo apt install python3 python3-pip

uv venv –python 3.12 –seed

source .venv/bin/activate

uv pip install “vllm[audio]” –torch-backend=auto

Start Server

vllm serve openai/whisper-large-v3 –host 0.0.0.0 –port 8000 –gpu-memory-utilization 0.85

Access from Local Network (WSL)

1. Get Windows host IP

ipconfig.exe

grep “IPv4”

head -1

2. Allow port 8000 in Windows Firewall (run in PowerShell as Administrator)

New-NetFirewallRule -DisplayName “VLLM Server” -Direction Inbound -LocalPort 8000 -Protocol TCP -Action Allow

Test Transcription

Convert audio if needed (WebM/Opus to WAV)

sudo apt install ffmpeg

ffmpeg -i input.webm -ar 16000 -ac 1 output.wav -y

curl -s http://localhost:8000/v1/audio/transcriptions \

-F “model=openai/whisper-large-v3” \

-F “file=@/path/to/audio.wav” \

-F “language=en” \

-F “response_format=verbose_json”

source .venv/bin/activate && vllm serve openai/whisper-medium –host 0.0.0.0 –port 8000 –gpu-memory-utilization 0.85

vllm serve openai/whisper-medium –host 0.0.0.0 –port 8000

source .venv/bin/activate && vllm serve openai/whisper-medium --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.85 --uvicorn-log-level debug --enable-log-requests

Pro Engels

Vllm Whisper

vllm

Log

Client

vllm

Start Server

Access from Local Network (WSL)

1. Get Windows host IP

2. Allow port 8000 in Windows Firewall (run in PowerShell as Administrator)

New-NetFirewallRule -DisplayName “VLLM Server” -Direction Inbound -LocalPort 8000 -Protocol TCP -Action Allow

Test Transcription

Convert audio if needed (WebM/Opus to WAV)

Links

Share on

You May Also Enjoy

Raspi Monitoring

Anna’s Archive Loses .LI Domain as Legal Pressure Mounts (torrentfreak.com)

Missing Lectures_cs

Sanktionen und Souveränität