Whisper-server does not support Cuda? #3095

chrisspen · 2025-04-29T17:23:11Z

I compiled whisper with Cuda support, but trying to run whisper-server shows it's painfully slow.

If I run the standalone whisper on a server with 50 cuda cores, whisper's able to transcribe the jfk.wav in under a second.

However, if I try the same thing on a gpu with 2000 cuda cores, it takes whisper 17 seconds, apparently because whisper-server does not support Cuda, even if you compiled it with Cuda support.

Is this correct? I'm not seeing any gpu options in whisper-server --help

The text was updated successfully, but these errors were encountered:

This commit adds the the command line option `--no-gpu` to the server examples print usage function. The motivation for this is that this options is available and can be set but it is not displayed in the usage message. Refs: ggml-org#3095

danbev · 2025-04-30T10:10:04Z

Is this correct? I'm not seeing any gpu options in whisper-server --help

Yes it seems that while it is possible to pass the --no-gpu command line argument to whisper-server it is not current displayed in the usage output for whisper-server.

But by default use_gpu will be true and you should be seeing something like the following when starting the server:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1  <---------------------------
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4070)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:        CUDA0 total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: using CUDA0 backend  <---------------------------
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   17.22 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   97.27 MB

whisper server listening at http://127.0.0.1:8080

When you start the server on the gpu with 2000 cuda cores, does the output look like something like the above where there is a using CUDA?

And if you pass --no-gpu you should see something like this:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 0  <---------------------------
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4070)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:          CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: no GPU found  <---------------------------
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   16.26 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   96.35 MB

whisper server listening at http://127.0.0.1:8080

peardox · 2025-05-01T00:13:16Z

They really should make device selection an option

I'm doing that in my Delphi Binding

There is a small issue if you give it multiple GPU devices to choose from though. It appears to use whichever comes first. If I give it both Vulkan and Cuda then the order they're supplied in becomes relevant.

I could have a fast Vulkan device and a 1050. The default code picks CUDA first (just cos it happens to be coded alphabetically - except CPU) so without any fiddling it'd pick the 1050

This commit adds the the command line option `--no-gpu` to the server examples print usage function. The motivation for this is that this options is available and can be set but it is not displayed in the usage message. Refs: #3095

danbev mentioned this issue Apr 30, 2025

server : add --no-gpu option to print usage output #3098

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper-server does not support Cuda? #3095

Whisper-server does not support Cuda? #3095

chrisspen commented Apr 29, 2025

danbev commented Apr 30, 2025

peardox commented May 1, 2025 •

edited

Loading

Whisper-server does not support Cuda? #3095

Whisper-server does not support Cuda? #3095

Comments

chrisspen commented Apr 29, 2025

danbev commented Apr 30, 2025

peardox commented May 1, 2025 • edited Loading

peardox commented May 1, 2025 •

edited

Loading