Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TensorRT ExecutionProvider] Cannot infer the model on a GPU device with an ID other than 0 #21276

Open
dat58 opened this issue Jul 8, 2024 · 6 comments
Assignees
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider

Comments

@dat58
Copy link

dat58 commented Jul 8, 2024

Describe the issue

In a scenario where multiple GPU devices are available, when selecting the TensorrtExecutionProvider and choosing device_id = 0, the model infers perfectly. However, when using a different device_id (not equal to 0), an error is thrown during inference:

[2024-07-06 07:03:24   ERROR] 1: [reformat.cpp::executeCutensor::332] Error Code 1: CuTensor (Internal cuTensor permutate execute failed)
[2024-07-06 07:03:24   ERROR] 1: [checkMacros.cpp::catchCudaError::181] Error Code 1: Cuda Runtime (invalid resource handle)

I noticed that a similar problem occurred with the CudaExecutionProvider a few years ago, and it was resolved in issue #1815 (I have tested it, and it works correctly). It is possible that a similar issue has occurred with the TensorrtExecutionProvider.

To reproduce

Tested on both versions:

  • tensorrt@8.6.1.6-1+cuda11.8
  • tensorrt@10.0.1.6-1+cuda11.8

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3 | 1.17.3 | 1.18.1

ONNX Runtime API

Rust with binding from C

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.8

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider labels Jul 8, 2024
@yufenglee yufenglee removed the ep:CUDA issues related to the CUDA execution provider label Jul 9, 2024
@chilo-ms
Copy link
Contributor

chilo-ms commented Jul 9, 2024

I tried running ResNet50 on device other than 0 with TRT EP, and it can successfully run the inference.

Can you run inference on device other than 0 with CUDA EP?
Could you share the script/code to help repro the issue?

Also, it also help if you turn on the verbose log and share the log.
you might see log related to provider options as below:

...
024-07-09 16:31:10.283704157 [V:onnxruntime:Default, tensorrt_execution_provider.cc:1660 TensorrtExecutionProvider] [TensorRT EP] TensorRT provider options: device_id: 2, trt_max_partition_iterations: 1000, trt_min_subgraph_size: 1, trt_max_workspace_size: 1073741824, trt_fp16_enable: 0, trt_int8_enable: 0, trt_int8_calibration_cache_name: , int8_calibration_cache_available: 0, trt_int8_use_native_tensorrt_calibration_table: 0, trt_dla_enable: 0, trt_dla_core: 0, trt_dump_subgraphs: 0, trt_engine_cache_enable: 0, trt_weight_stripped_engine_enable: 0, trt_onnx_model_folder_path: , trt_cache_path: , trt_global_cache_path: , trt_engine_decryption_enable: 0, trt_engine_decryption_lib_path: , trt_force_sequential_engine_build: 0, trt_context_memory_sharing_enable: 0, trt_layer_norm_fp32_fallback: 0, trt_build_heuristics_enable: 0, trt_sparsity_enable: 0, trt_builder_optimization_level: 3, trt_auxiliary_streams: -1, trt_tactic_sources: , trt_profile_min_shapes: , trt_profile_max_shapes: , trt_profile_opt_shapes: , trt_cuda_graph_enable: 0, trt_dump_ep_context_model: 0, trt_ep_context_file_path: , trt_ep_context_embed_mode: 0, trt_cache_prefix: , trt_engine_hw_compatible: 0
....
@dat58
Copy link
Author

dat58 commented Jul 11, 2024

I have used the ort@2.0.0-rc.2 Rust binding code from C (since I am not familiar with C code). Below is a simple code snippet that I have used::

let model_path = "warehouse/model.onnx";
let global_options = ort::EnvironmentGlobalThreadPoolOptions {
            intra_op_parallelism: Some(16),
            ..Default::default()
};
let _ = ort::init().with_global_thread_pool(global_options).commit()?;
let provider = ort::TensorRTExecutionProvider::default()
                        .with_device_id(1)
                        .with_engine_cache(true)
                        .with_engine_cache_path("warehouse".to_owned())
                        .with_profile_min_shapes("images:1x3x256x256".to_owned())
                        .with_profile_opt_shapes("images:32x3x256x256".to_owned())
                        .with_profile_max_shapes("images:64x3x256x256".to_owned())
                        .with_max_partition_iterations(10)                
                        .with_max_workspace_size(2 * 1024 * 1024 * 1024)
                        .build();
let session = Session::builder()?
                .with_optimization_level(ort::GraphOptimizationLevel::Level3)?
                .with_execution_providers([provider])?
                .commit_from_file(model_path)?;

Inference with

let outputs = session.run_async(inputs)?.await?;

I encountered the error mentioned above. However, when I tried running it with the CUDA Execution Provider (EP), it worked perfectly fine.

@yf711
Copy link
Contributor

yf711 commented Jul 12, 2024

Hi @dat58 does other model work on your script with TensorRT EP and gpu_id=1? Like ResNet50?

Btw I saw you previously posted an issue pykeio/ort#226 with gpu_id=0 and same error type. Is this issue still happening to single GPU? If you already fixed it, what did you do to fix?

@dat58
Copy link
Author

dat58 commented Jul 12, 2024

Hi @yf711, posting the issue at ort was my mistake of copying the code incorrectly. It should be gpu_id != 0. I have conducted multiple tests and realized that if I append the Cuda EP after TensorRT (provider = [TensorRTExecutionProvider, CudaExecutionProvider]), the model inference is successful.

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Jul 12, 2024
@yf711
Copy link
Contributor

yf711 commented Jul 12, 2024

Hi @yf711, posting the issue at ort was my mistake of copying the code incorrectly. It should be gpu_id != 0. I have conducted multiple tests and realized that if I append the Cuda EP after TensorRT (provider = [TensorRTExecutionProvider, CudaExecutionProvider]), the model inference is successful.

Thanks for sharing. Did you enable both TRT/CUDA EP to your multiGPU script as well? If so, you are welcome to share your script file and model (without your IP) bundle to help repro this issue.

Btw, are all your multiple GPUs same architecture? Is that possible that your existing engine cache generated by your gpu_id:0 and consumed by gpu_id:1, which has different architecture and not compatible? I am not sure if this rust binding could allow that

@dat58
Copy link
Author

dat58 commented Jul 13, 2024

To reproduce the issue, it could be complicated. I have written a mini Rust code to reproduce this problem. Please download it from the source. In my shared folder, there are three files: Dockerfile, run.sh, projects.zip.

The projects.zip file contains two projects. The trtsample project is used to start the HTTP server, receive incoming requests, and simulate model inference. The loadtest project is used to send a large number of requests to the HTTP server.

To create the environment and run the Rust project, use the Dockerfile provided.

docker build -t rustai:ort .

And execute run.sh to start a container.

bash run.sh

Extract file projects.zip to your $HOME folder
image

After following these steps, you will have mounted the projects directory inside the container.

Now, you must enter the ort container.

docker exec -it ort bash

And then follow these scenarios:

Scenario 1: Start the server using the CPU EP to verify that the server was configured correctly.

# terminal 1
cd /projects/trtsample
cargo run --release

# terminal 2
cd /projects/loadtest
bash run.sh

The HTTP server must be successfully running.

Scenario 2: Start the server using the TENSORRT EP.

# terminal 1
cd /projects/trtsample
cargo run --release -F tensorrt

# terminal 2
cd /projects/loadtest
bash run.sh

In my test, the HTTP server panics immediately after running the loadtest. I have set the default GPU_ID = 1 in the file /projects/trtsample/src/main.rs, so it is expected to panic. However, if you change the GPU_ID = 0, the loadtest will run successfully.

Scenario 3: Start the server using the TENSORRT EP + CUDA EP.

# terminal 1
cd /projects/trtsample
cargo run --release -F tensorrt_cuda

# terminal 2
cd /projects/loadtest
bash run.sh

You should change GPU_ID = 1 before running this test to observe that the issue has been resolved with the TensorRT EP.

NOTE: My machine is equipped with 8 NVIDIA RTX 4090 GPUs, and the driver version is 550.90.07.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider
5 participants