[TensorRT ExecutionProvider] Cannot infer the model on a GPU device with an ID other than 0 #21276

dat58 · 2024-07-08T02:53:04Z

Describe the issue

In a scenario where multiple GPU devices are available, when selecting the TensorrtExecutionProvider and choosing device_id = 0, the model infers perfectly. However, when using a different device_id (not equal to 0), an error is thrown during inference:

[2024-07-06 07:03:24   ERROR] 1: [reformat.cpp::executeCutensor::332] Error Code 1: CuTensor (Internal cuTensor permutate execute failed)
[2024-07-06 07:03:24   ERROR] 1: [checkMacros.cpp::catchCudaError::181] Error Code 1: Cuda Runtime (invalid resource handle)

I noticed that a similar problem occurred with the CudaExecutionProvider a few years ago, and it was resolved in issue #1815 (I have tested it, and it works correctly). It is possible that a similar issue has occurred with the TensorrtExecutionProvider.

To reproduce

Tested on both versions:

tensorrt@8.6.1.6-1+cuda11.8
tensorrt@10.0.1.6-1+cuda11.8

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3 | 1.17.3 | 1.18.1

ONNX Runtime API

Rust with binding from C

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.8

The text was updated successfully, but these errors were encountered:

chilo-ms · 2024-07-09T16:32:50Z

I tried running ResNet50 on device other than 0 with TRT EP, and it can successfully run the inference.

Can you run inference on device other than 0 with CUDA EP?
Could you share the script/code to help repro the issue?

Also, it also help if you turn on the verbose log and share the log.
you might see log related to provider options as below:

...
024-07-09 16:31:10.283704157 [V:onnxruntime:Default, tensorrt_execution_provider.cc:1660 TensorrtExecutionProvider] [TensorRT EP] TensorRT provider options: device_id: 2, trt_max_partition_iterations: 1000, trt_min_subgraph_size: 1, trt_max_workspace_size: 1073741824, trt_fp16_enable: 0, trt_int8_enable: 0, trt_int8_calibration_cache_name: , int8_calibration_cache_available: 0, trt_int8_use_native_tensorrt_calibration_table: 0, trt_dla_enable: 0, trt_dla_core: 0, trt_dump_subgraphs: 0, trt_engine_cache_enable: 0, trt_weight_stripped_engine_enable: 0, trt_onnx_model_folder_path: , trt_cache_path: , trt_global_cache_path: , trt_engine_decryption_enable: 0, trt_engine_decryption_lib_path: , trt_force_sequential_engine_build: 0, trt_context_memory_sharing_enable: 0, trt_layer_norm_fp32_fallback: 0, trt_build_heuristics_enable: 0, trt_sparsity_enable: 0, trt_builder_optimization_level: 3, trt_auxiliary_streams: -1, trt_tactic_sources: , trt_profile_min_shapes: , trt_profile_max_shapes: , trt_profile_opt_shapes: , trt_cuda_graph_enable: 0, trt_dump_ep_context_model: 0, trt_ep_context_file_path: , trt_ep_context_embed_mode: 0, trt_cache_prefix: , trt_engine_hw_compatible: 0
....

dat58 · 2024-07-11T03:40:15Z

I have used the ort@2.0.0-rc.2 Rust binding code from C (since I am not familiar with C code). Below is a simple code snippet that I have used::

let model_path = "warehouse/model.onnx";
let global_options = ort::EnvironmentGlobalThreadPoolOptions {
            intra_op_parallelism: Some(16),
            ..Default::default()
};
let _ = ort::init().with_global_thread_pool(global_options).commit()?;
let provider = ort::TensorRTExecutionProvider::default()
                        .with_device_id(1)
                        .with_engine_cache(true)
                        .with_engine_cache_path("warehouse".to_owned())
                        .with_profile_min_shapes("images:1x3x256x256".to_owned())
                        .with_profile_opt_shapes("images:32x3x256x256".to_owned())
                        .with_profile_max_shapes("images:64x3x256x256".to_owned())
                        .with_max_partition_iterations(10)                
                        .with_max_workspace_size(2 * 1024 * 1024 * 1024)
                        .build();
let session = Session::builder()?
                .with_optimization_level(ort::GraphOptimizationLevel::Level3)?
                .with_execution_providers([provider])?
                .commit_from_file(model_path)?;

Inference with

let outputs = session.run_async(inputs)?.await?;

I encountered the error mentioned above. However, when I tried running it with the CUDA Execution Provider (EP), it worked perfectly fine.

yf711 · 2024-07-12T06:55:50Z

Hi @dat58 does other model work on your script with TensorRT EP and gpu_id=1? Like ResNet50?

Btw I saw you previously posted an issue pykeio/ort#226 with gpu_id=0 and same error type. Is this issue still happening to single GPU? If you already fixed it, what did you do to fix?

dat58 · 2024-07-12T15:09:22Z

Hi @yf711, posting the issue at ort was my mistake of copying the code incorrectly. It should be gpu_id != 0. I have conducted multiple tests and realized that if I append the Cuda EP after TensorRT (provider = [TensorRTExecutionProvider, CudaExecutionProvider]), the model inference is successful.

yf711 · 2024-07-12T17:13:03Z

Hi @yf711, posting the issue at ort was my mistake of copying the code incorrectly. It should be gpu_id != 0. I have conducted multiple tests and realized that if I append the Cuda EP after TensorRT (provider = [TensorRTExecutionProvider, CudaExecutionProvider]), the model inference is successful.

Thanks for sharing. Did you enable both TRT/CUDA EP to your multiGPU script as well? If so, you are welcome to share your script file and model (without your IP) bundle to help repro this issue.

Btw, are all your multiple GPUs same architecture? Is that possible that your existing engine cache generated by your gpu_id:0 and consumed by gpu_id:1, which has different architecture and not compatible? I am not sure if this rust binding could allow that

dat58 · 2024-07-13T08:31:55Z

To reproduce the issue, it could be complicated. I have written a mini Rust code to reproduce this problem. Please download it from the source. In my shared folder, there are three files: Dockerfile, run.sh, projects.zip.

The projects.zip file contains two projects. The trtsample project is used to start the HTTP server, receive incoming requests, and simulate model inference. The loadtest project is used to send a large number of requests to the HTTP server.

To create the environment and run the Rust project, use the Dockerfile provided.

docker build -t rustai:ort .

And execute run.sh to start a container.

bash run.sh

Extract file projects.zip to your $HOME folder

After following these steps, you will have mounted the projects directory inside the container.

Now, you must enter the ort container.

docker exec -it ort bash

And then follow these scenarios:

Scenario 1: Start the server using the CPU EP to verify that the server was configured correctly.

# terminal 1
cd /projects/trtsample
cargo run --release

# terminal 2
cd /projects/loadtest
bash run.sh

The HTTP server must be successfully running.

Scenario 2: Start the server using the TENSORRT EP.

# terminal 1
cd /projects/trtsample
cargo run --release -F tensorrt

# terminal 2
cd /projects/loadtest
bash run.sh

In my test, the HTTP server panics immediately after running the loadtest. I have set the default GPU_ID = 1 in the file /projects/trtsample/src/main.rs, so it is expected to panic. However, if you change the GPU_ID = 0, the loadtest will run successfully.

Scenario 3: Start the server using the TENSORRT EP + CUDA EP.

# terminal 1
cd /projects/trtsample
cargo run --release -F tensorrt_cuda

# terminal 2
cd /projects/loadtest
bash run.sh

You should change GPU_ID = 1 before running this test to observe that the issue has been resolved with the TensorRT EP.

NOTE: My machine is equipped with 8 NVIDIA RTX 4090 GPUs, and the driver version is 550.90.07.

github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Jul 8, 2024

yufenglee removed the ep:CUDA issues related to the CUDA execution provider label Jul 9, 2024

yufenglee assigned jywu-msft Jul 9, 2024

jywu-msft assigned chilo-ms Jul 9, 2024

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TensorRT ExecutionProvider] Cannot infer the model on a GPU device with an ID other than 0 #21276

[TensorRT ExecutionProvider] Cannot infer the model on a GPU device with an ID other than 0 #21276

dat58 commented Jul 8, 2024 •

edited

Loading

chilo-ms commented Jul 9, 2024 •

edited

Loading

dat58 commented Jul 11, 2024

yf711 commented Jul 12, 2024

dat58 commented Jul 12, 2024

yf711 commented Jul 12, 2024 •

edited

Loading

dat58 commented Jul 13, 2024 •

edited

Loading

[TensorRT ExecutionProvider] Cannot infer the model on a GPU device with an ID other than 0 #21276

[TensorRT ExecutionProvider] Cannot infer the model on a GPU device with an ID other than 0 #21276

Comments

dat58 commented Jul 8, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

chilo-ms commented Jul 9, 2024 • edited Loading

dat58 commented Jul 11, 2024

yf711 commented Jul 12, 2024

dat58 commented Jul 12, 2024

yf711 commented Jul 12, 2024 • edited Loading

dat58 commented Jul 13, 2024 • edited Loading

Scenario 1: Start the server using the CPU EP to verify that the server was configured correctly.

Scenario 2: Start the server using the TENSORRT EP.

Scenario 3: Start the server using the TENSORRT EP + CUDA EP.

NOTE: My machine is equipped with 8 NVIDIA RTX 4090 GPUs, and the driver version is 550.90.07.

dat58 commented Jul 8, 2024 •

edited

Loading

chilo-ms commented Jul 9, 2024 •

edited

Loading

yf711 commented Jul 12, 2024 •

edited

Loading

dat58 commented Jul 13, 2024 •

edited

Loading