Running on a RTX5060Ti

It works very slow with none as acceleration. When i use 

(personalive) PS D:\PersonaLive> PS>python inference_online.py --acceleration tensorrt


host: 0.0.0.0
port: 7860
reload: False
mode: default
max_queue_size: 0
timeout: 0.0
safety_checker: False
taesd: True
ssl_certfile: None
ssl_keyfile: None
debug: False
acceleration: tensorrt
engine_dir: engines
config_path: ./configs/prompts/personalive_online.yaml


C:\Users\movla\.conda\envs\personalive\lib\site-packages\diffusers\models\dual_transformer_2d.py:20: FutureWarning: `DualTransformer2DModel` is deprecated and will be removed in version 0.29. Importing `DualTransformer2DModel` from `diffusers.models.dual_transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.transformers.dual_transformer_2d import DualTransformer2DModel`, instead.
  deprecate("DualTransformer2DModel", "0.29", deprecation_message)


host: 0.0.0.0
port: 7860
reload: False
mode: default
max_queue_size: 0
timeout: 0.0
safety_checker: False
taesd: True
ssl_certfile: None
ssl_keyfile: None
debug: False
acceleration: tensorrt
engine_dir: engines
config_path: ./configs/prompts/personalive_online.yaml


C:\Users\movla\.conda\envs\personalive\lib\site-packages\diffusers\models\dual_transformer_2d.py:20: FutureWarning: `DualTransformer2DModel` is deprecated and will be removed in version 0.29. Importing `DualTransformer2DModel` from `diffusers.models.dual_transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.transformers.dual_transformer_2d import DualTransformer2DModel`, instead.
  deprecate("DualTransformer2DModel", "0.29", deprecation_message)
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
 ['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias']
Loading engine from file ./pretrained_weights/tensorrt/unet_work.engine...
[EngineModel] Detected dynamic shape for d00: (1, -1, 320) -> Using Max Profile: (1, 16384, 320)
[EngineModel] Detected dynamic shape for d01: (1, -1, 320) -> Using Max Profile: (1, 16384, 320)
[EngineModel] Detected dynamic shape for d10: (1, -1, 640) -> Using Max Profile: (1, 4096, 640)
[EngineModel] Detected dynamic shape for d11: (1, -1, 640) -> Using Max Profile: (1, 4096, 640)
[EngineModel] Detected dynamic shape for d20: (1, -1, 1280) -> Using Max Profile: (1, 1024, 1280)
[EngineModel] Detected dynamic shape for d21: (1, -1, 1280) -> Using Max Profile: (1, 1024, 1280)
[EngineModel] Detected dynamic shape for m: (1, -1, 1280) -> Using Max Profile: (1, 256, 1280)
[EngineModel] Detected dynamic shape for u10: (1, -1, 1280) -> Using Max Profile: (1, 1024, 1280)
[EngineModel] Detected dynamic shape for u11: (1, -1, 1280) -> Using Max Profile: (1, 1024, 1280)
[EngineModel] Detected dynamic shape for u12: (1, -1, 1280) -> Using Max Profile: (1, 1024, 1280)
[EngineModel] Detected dynamic shape for u20: (1, -1, 640) -> Using Max Profile: (1, 4096, 640)
[EngineModel] Detected dynamic shape for u21: (1, -1, 640) -> Using Max Profile: (1, 4096, 640)
[EngineModel] Detected dynamic shape for u22: (1, -1, 640) -> Using Max Profile: (1, 4096, 640)
[EngineModel] Detected dynamic shape for u30: (1, -1, 320) -> Using Max Profile: (1, 16384, 320)
[EngineModel] Detected dynamic shape for u31: (1, -1, 320) -> Using Max Profile: (1, 16384, 320)
[EngineModel] Detected dynamic shape for u32: (1, -1, 320) -> Using Max Profile: (1, 16384, 320)
Failed to enable xformers: Refer to https://github.com/facebookresearch/xformers for more information on how to install xformers
D:\PersonaLive\inference_online.py:240: DeprecationWarning:
        on_event is deprecated, use lifespan event handlers instead.

        Read more about it in the
        [FastAPI docs for Lifespan Events](https://fastapi.tiangolo.com/advanced/events/).

  @self.app.on_event("shutdown")
init done
INFO:     Started server process [15440]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)

I get this alternating /

<img width="512" height="512" alt="Image" src="https://github.com/user-attachments/assets/8dda05a7-08bf-4291-95f9-dba76e4d3548" />


Then This  // 

<img width="512" height="512" alt="Image" src="https://github.com/user-attachments/assets/18d85797-7dd2-460a-b0ed-8baa06b9d9b5" />

every second or so//
any ideas?


Update, I got it running on RTX5060TI but to slow for any real work.  With  "python inference_online.py --acceleration tensorrt"

<img width="512" height="512" alt="Image" src="https://github.com/user-attachments/assets/e9becd9c-44d9-4f0c-ae82-4b4107d729e8" />

Summary of Fixes for RTX 5060 Ti (PersonaLive TensorRT)
Root Cause
PyTorch's ONNX tracer was forcing FP32 during autocast blocks while model weights were FP16, causing dtype mismatches at BatchNorm and Conv layers.

Files Modified (5 files)
File	Change
torch2trt.py	Force model to FP16 + eval mode; set auto_cast=False in export call
onnx_export.py	Autocast manager set to torch.float16; only cast floating point tensors
framed_models.py	Removed torch.amp.autocast block (no longer needed)
FAN_feature_extractor.py	Removed x = x.float() from FAN_SA.forward
webcam/util.py	Fixed array_to_image to handle FP16 value ranges ([-1,1] → [0,255])
Key Code Changes
1. torch2trt.py

python
model = model.to(device=device, dtype=torch.float16)
model.eval()
export_onnx(..., auto_cast=False, dtype=torch.float16)
2. onnx_export.py

python
with torch.autocast("cuda", dtype=torch.float16):  # was missing dtype
# and
if torch.is_floating_point(tensor):
    inputs[key] = tensor.to(dtype=dtype)
3. framed_models.py

python
# REMOVED this entire block:
# with torch.amp.autocast('cuda', dtype=torch.float16):
#     new_motion_hidden_states = self.motion_encoder(motion)
# CHANGED to:
new_motion_hidden_states = self.motion_encoder(motion)
4. FAN_feature_extractor.py

python
# REMOVED: x = x.float()
5. webcam/util.py

python
def array_to_image(image_array, normalize=True):
    if image_array.max() <= 1.0 and image_array.min() >= -1.0:
        image_array = (image_array + 1.0) / 2.0 * 255.0
    elif image_array.max() <= 1.0 and image_array.min() >= 0.0:
        image_array = image_array * 255.0
    image_array = np.clip(image_array, 0, 255).astype(np.uint8)
    return Image.fromarray(image_array)
Environment Requirement
CUDA Toolkit 12.8.2 (provides nvcc.exe)

Result
Before	After
Build failed: FloatTensor vs HalfTensor	✅ Build succeeds
TensorRT engine would not build	✅ Engine builds in ~6 minutes
Output pixelated or black	✅ Normal output
Commands to Run
powershell
conda activate personalive
python torch2trt.py
python inference_online.py --acceleration tensorrt

I hope this helps someone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on a RTX5060Ti #67

and

REMOVED this entire block:

with torch.amp.autocast('cuda', dtype=torch.float16):

new_motion_hidden_states = self.motion_encoder(motion)

CHANGED to:

REMOVED: x = x.float()

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running on a RTX5060Ti #67

Description

and

REMOVED this entire block:

with torch.amp.autocast('cuda', dtype=torch.float16):

new_motion_hidden_states = self.motion_encoder(motion)

CHANGED to:

REMOVED: x = x.float()

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions