Skip to content

Failure when selecting non-zero GPU index with multiple GPUs #3107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Masa-tam opened this issue May 1, 2025 · 1 comment
Open

Failure when selecting non-zero GPU index with multiple GPUs #3107

Masa-tam opened this issue May 1, 2025 · 1 comment

Comments

@Masa-tam
Copy link

Masa-tam commented May 1, 2025

When using whisper.cpp version 1.7.5 in an environment with multiple GPUs, specifying a gpu_device other than 0 causes an assertion failure and termination.
This occurs in make_buft_list, which creates a list of device-buffer pairs.
There's an issue in the code that attempts to use GPU 0 when the specified gpu_device is greater than 0 and that specific GPU is not present in the environment.
For example, in an environment with two GPUs, if gpu_device == 1 is set, make_buft_list incorrectly registers device-buffer pairs for both GPU 0 and GPU 1.
While the device selection correctly chooses GPU 1, the buffer selection process only checks the device type and the buffer's device type, without considering the actual device ID.
Consequently, it selects the buffer associated with GPU 0, which is the first entry in the buffer list.
This mismatch between the execution device (GPU 1) and the buffer's device (GPU 0) leads to the assertion failure.
Although it's a very crude hack, inserting the following code starting at line 1406 in whisper.cpp, right after the GPU loop finishes, allows the program to work correctly:

        if (buft_list.size() > 1)
        {
            buft_list.erase(buft_list.begin());
        }

This code forcibly removes the entry for GPU 0 from buft_list if both the GPU 0 entry and the specified GPU entry exist.

@peardox
Copy link

peardox commented May 1, 2025

A lot of the code that checks devices blindly ignores null pointers

e.g. whisper_backend_init_gpu

    if (params.use_gpu) {
        for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
            ggml_backend_dev_t dev_cur = ggml_backend_dev_get(i);
            if(dev_cur == nullptr) {
                continue;
            }
            if (ggml_backend_dev_type(dev_cur) == GGML_BACKEND_DEVICE_TYPE_GPU) {

I added the nullptr continue in a PR so you won't have it but it is possible to get a null back from the dev_get call for example

There's stuff like this all over the place, especially in ggml
e.g. pass this a nullptr = Core Cump

ggml_backend_dev_t ggml_backend_buft_get_device(ggml_backend_buffer_type_t buft) {
    return buft->device;
}

I was told off for fixing some of those..:(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants