-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Hunyuan Video Framepack #11428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hunyuan Video Framepack #11428
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thank you @a-r-r-o-w For some reason getting error
Replaced with and with
|
Also, bitsandbytes seem to crash a 4090, or I'm doing something wrong: import torch
from diffusers import BitsAndBytesConfig, HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
from diffusers.utils import export_to_video, load_image
from transformers import SiglipImageProcessor, SiglipVisionModel
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
"lllyasviel/FramePackI2V_HY",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
)
feature_extractor = SiglipImageProcessor.from_pretrained(
"lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
)
image_encoder = SiglipVisionModel.from_pretrained(
"lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
)
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
transformer=transformer,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
torch_dtype=torch.float16,
)
#pipe.vae.enable_tiling()
#pipe.enable_model_cpu_offload()
pipe.to("cuda")
image = load_image("https://i.ibb.co/35CWK8rv/pinguin.png")
output = pipe(
image=image,
prompt="A penguin dancing in the snow",
height=832,
width=480,
num_frames=91,
num_inference_steps=30,
guidance_scale=9.0,
generator=torch.Generator().manual_seed(0),
).frames[0]
export_to_video(output, "output.mp4", fps=30) |
@nitinmukesh The latest commit should fix the problem you're facing with |
@tin2tin Tested your code with the latest commit and seems to run for both full-cuda and cpu-offload. Could you try again? output.mp4 |
@nitinmukesh Unable to get the sequential offloading to work at the moment. It fails in the output projection layer of attention in siglip's pooling head:
reproducerimport torch
from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
from diffusers.utils import export_to_video, load_image
from transformers import SiglipImageProcessor, SiglipVisionModel
transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained("lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16)
feature_extractor = SiglipImageProcessor.from_pretrained("lllyasviel/flux_redux_bfl", subfolder="feature_extractor")
image_encoder = SiglipVisionModel.from_pretrained("lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16)
pipe = HunyuanVideoFramepackPipeline.from_pretrained("hunyuanvideo-community/HunyuanVideo", transformer=transformer, feature_extractor=feature_extractor, image_encoder=image_encoder, torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()
image = load_image("inputs/penguin.png")
output = pipe(
image=image,
prompt="A penguin dancing in the snow",
height=832,
width=480,
num_frames=31,
num_inference_steps=2,
guidance_scale=9.0,
generator=torch.Generator().manual_seed(0),
).frames[0]
export_to_video(output, "output.mp4", fps=30) stack trace
I'm not sure what to look into specifically since it could be an accelerate unhandled case or the layer implementation is not compatible with sequential offloading. Sequential offloading requires the forward method to be invoked for the pre-forward hook to move the weights to correct device, but as can be seen in the torch implementation, diffusers/src/diffusers/models/embeddings.py Line 1689 in 58431f1
cc @SunMarc |
It seems to be doing the inference with tiling and cpu offload: Forgot to change the path to something I can find on my computer, so I need to rerun it. |
Would recommend installing and using imageio instead of opencv, since it is planned to be deprecated soon
|
Yes, it is working here too: 8 minutes and 51 seconds. Thank you! output.mp4 |
flf2v example is producing a video with bitsandbytes too (loading: 16 GB, inference: 12 GB, 4 minutes and 43 seconds): import torch
from diffusers import (
BitsAndBytesConfig,
HunyuanVideoFramepackPipeline,
HunyuanVideoFramepackTransformer3DModel,
)
from diffusers.utils import export_to_video, load_image
from transformers import SiglipImageProcessor, SiglipVisionModel
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
"lllyasviel/FramePackI2V_HY",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
)
feature_extractor = SiglipImageProcessor.from_pretrained(
"lllyasviel/flux_redux_bfl", subfolder="feature_extractor",
)
image_encoder = SiglipVisionModel.from_pretrained(
"lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
)
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
transformer=transformer,
feature_extractor=feature_extractor,
image_encoder=image_encoder,
torch_dtype=torch.float16,
)
pipe.vae.enable_tiling()
pipe.enable_model_cpu_offload()
prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
first_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png"
)
last_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png"
)
output = pipe(
image=first_image,
last_image=last_image,
prompt=prompt,
height=512,
width=512,
num_frames=91,
num_inference_steps=30,
guidance_scale=9.0,
generator=torch.Generator().manual_seed(0),
).frames[0]
export_to_video(output, "C:/Users/peter/Downloads/output.mp4", fps=30) output.mp4 |
Thank you for supporting model offload. |
@a-r-r-o-w Tried to improve the image and prompt for testing. If you need it later on feature demonstration: bird_output.mp4And a little prince: prince_output.mp4 |
@a-r-r-o-w Apparently, it is also possible to implement t2v: lllyasviel/FramePack#266 and Hunyuan LoRA suport: https://github.com/colinurbs/FramePack-Studio/blob/main/diffusers_helper/lora_utils.py but maybe it would be better for later patches? |
@tin2tin T2V addition sounds good! I'll add that in a follow-up PR once this is merged. LoRA should work already I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! looking great!
I left one comment on the transformer;
src/diffusers/models/transformers/transformer_hunyuan_video_framepack.py
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_hunyuan_video_framepack.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_hunyuan_video_framepack.py
Show resolved
Hide resolved
FramePack F1 released: |
@yiyixuxu Addressed review comments. Can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_framepack.py
Show resolved
Hide resolved
Are the memory handling and speed improvements of FramePack included in this patch? I noticed huge speed differences between running FramePack as standalone (ca. 1 min. for 1 sec) and via Diffusers (ca. 10 min for 1 sec) using a 4090. |
src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_framepack.py
Show resolved
Hide resolved
@tin2tin 1 sec of video as in 31 frames generated and saved at 30 fps? I can't seem to reproduce it taking 10 mins on a 4090. |
Will test some more. |
This PR adds support for Framepack: https://github.com/lllyasviel/FramePack
I2V example
output.mp4
FLF2V example
Credits: lllyasviel/FramePack#167
output.mp4
TODO: