-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Whisper-server does not support Cuda? #3095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This commit adds the the command line option `--no-gpu` to the server examples print usage function. The motivation for this is that this options is available and can be set but it is not displayed in the usage message. Refs: ggml-org#3095
Yes it seems that while it is possible to pass the But by default whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu = 1 <---------------------------
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4070)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
whisper_init_with_params_no_state: devices = 2
whisper_init_with_params_no_state: backends = 2
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CUDA0 total size = 147.37 MB
whisper_model_load: model size = 147.37 MB
whisper_backend_init_gpu: using CUDA0 backend <---------------------------
whisper_init_state: kv self size = 6.29 MB
whisper_init_state: kv cross size = 18.87 MB
whisper_init_state: kv pad size = 3.15 MB
whisper_init_state: compute buffer (conv) = 17.22 MB
whisper_init_state: compute buffer (encode) = 85.86 MB
whisper_init_state: compute buffer (cross) = 4.65 MB
whisper_init_state: compute buffer (decode) = 97.27 MB
whisper server listening at http://127.0.0.1:8080 When you start the server on the gpu with 2000 cuda cores, does the output look like something like the above where there is a And if you pass whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu = 0 <---------------------------
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 4070)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i7-1260P)
whisper_init_with_params_no_state: devices = 2
whisper_init_with_params_no_state: backends = 2
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 147.37 MB
whisper_model_load: model size = 147.37 MB
whisper_backend_init_gpu: no GPU found <---------------------------
whisper_init_state: kv self size = 6.29 MB
whisper_init_state: kv cross size = 18.87 MB
whisper_init_state: kv pad size = 3.15 MB
whisper_init_state: compute buffer (conv) = 16.26 MB
whisper_init_state: compute buffer (encode) = 85.86 MB
whisper_init_state: compute buffer (cross) = 4.65 MB
whisper_init_state: compute buffer (decode) = 96.35 MB
whisper server listening at http://127.0.0.1:8080
|
They really should make device selection an option I'm doing that in my Delphi Binding There is a small issue if you give it multiple GPU devices to choose from though. It appears to use whichever comes first. If I give it both Vulkan and Cuda then the order they're supplied in becomes relevant. I could have a fast Vulkan device and a 1050. The default code picks CUDA first (just cos it happens to be coded alphabetically - except CPU) so without any fiddling it'd pick the 1050 |
This commit adds the the command line option `--no-gpu` to the server examples print usage function. The motivation for this is that this options is available and can be set but it is not displayed in the usage message. Refs: #3095
I compiled whisper with Cuda support, but trying to run whisper-server shows it's painfully slow.
If I run the standalone whisper on a server with 50 cuda cores, whisper's able to transcribe the jfk.wav in under a second.
However, if I try the same thing on a gpu with 2000 cuda cores, it takes whisper 17 seconds, apparently because whisper-server does not support Cuda, even if you compiled it with Cuda support.
Is this correct? I'm not seeing any gpu options in
whisper-server --help
The text was updated successfully, but these errors were encountered: