[A1111] AnimateDiff + IPAdapter + OpenPose Operation Guide

Note: This operation is highly VRAM-intensive. When creating a short video, after loading ControlNet calculations, it consumed around 16GB of VRAM. If your VRAM is insufficient, it is recommended to use the ComfyUI method for better results.

Chinese Version

A1111 with AnimateDiff

The installation process here is similar to ComfyUI. Please install the following extensions:

Additionally, it is highly recommended to install FreeU:

FreeU

AnimateDiff Models

You can download the action models and action Lora from here:

https://huggingface.co/guoyww/animatediff/tree/main

Additionally, here are a few recommended action models for generating animations:

These model files need to be placed in this folder:
extensions/sd-webui-animatediff/model

If you download action Lora, it should be placed here:
extensions/sd-webui-animatediff/model/Lora

Once downloaded, if you need to use ControlNet, please put the required files in this location:
models/ControlNet

Of course, your main model needs to be placed in models/Stable-diffusion. If you have VAE, it should be placed in models/VAE.

AnimateDiff Interface Overview

After completing all installations, you will have an AnimateDiff user interface.

Enable AnimateDiff: Check this to enable the AnimateDiff feature.
Motion module: Select the motion model here.
Save format: Choose the output format, and you can check TXT if you want to save text information.
Number of frames: Default is 0. When using Video source or Video path, it will be automatically calculated. Otherwise, it will be calculated based on Context batch size. Please avoid using a number close to Context batch size, as mentioned in issue #213.
FPS: Frames per second.
Display loop number: If outputting as GIF, this indicates how many times it should loop. Default is 0 for continuous looping.
Context batch size: Represents how many frames AnimateDiff processes at once. Default is 16, which is a good value. Note: Different motion models may limit the maximum value of this number.
Closed loop: Attempts to create a looping animation. There are four operation modes, and it won't take effect when Number of frames is less than or equal to Context batch size.
- N: Not used. Effective only when Number of frames (excluding 0) is less than Context batch size.
- R-P: Reduces the number of loop animation content (context) and does not perform interpolated actions using Prompt Travel.
- R+P: Reduces the number of loop animation content (context) and performs interpolated actions using Prompt Travel.
- A: Connects the frames at the beginning to the frames at the end to create a looping animation, using Prompt Travel for interpolated actions.
Stride: This is difficult to explain but is mainly used to ensure that AnimateDiff maintains time synchronization between frames. Default is 1. You can refer to the original author's explanation WebUI Parameters. Similar to ComfyUI, this setting seems ineffective in vid2vid.
Overlap: Specifies how many frames to reserve as overlap for upper and lower context when processing images with AnimateDiff. Default is -1, and it uses Context batch size / 4. This setting is only effective when Number of frames is greater than Context batch size or Number of frames is 0.
Frame Interpolation: Default is OFF. If using FILM, it will use Deforum for video frame interpolation after AnimateDiff processing.
Interp X: When using Frame Interpolation = FILM, it takes X frames of output interpolated frames to interpolate into the input frames. It will extend the entire video, and you may need to speed up the video after output.
Video Source: Upload the video here.
Video Path: If no video is uploaded, you can enter the path of the video here. You need to place the frames of the video in this folder.
Move motion module to CPU (default if lowvram): In case of low VRAM, move the model to the system memory.
Remove motion module from any memory: Remove the motion model from any memory.

After uploading the video, some of the above parameters will be automatically populated.

ControlNet

Next, we need to prepare two ControlNets for use,

OpenPose
IPAdapter

Here, I am using IPAdapter and chose the ip-adapter-plus_sd15 model. The weight is set to 0.7 to avoid excessive interference with the output.

Additionally, I prepared the same number of OpenPose skeleton images as the uploaded video and placed them in the /output/openpose folder for this ControlNet to read.

If you don't have a tool to output a skeleton map, you can use the ControlNet preprocessor to help you. However, the prerequisite is that you still need to export all frames of the video as images.

You can use tools like FFmpeg to export all frames of the video as images, then place them in the Batch input folder, and use the DWPose or related OpenPose preprocessor for processing.

In my case, I'm using a skeleton map directly here to save some VRAM and time since preprocessing also takes a while.

Configuring the Text2Image

Pay attention to the Batch sizes, and I'll explain a bit here:

A1111 is different from ComfyUI in that ComfyUI considers the entire process as one Batch Counts. Therefore, in ComfyUI, you need to set Batch sizes to be the same as the number of frames read to avoid animation appearing unsmooth.
In A1111, it will be based on the Number of frames read by the AnimateDiff plugin and the source of your prepared ControlNet OpenPose. In this article's example, you will have 50 drawing steps.
If your Batch sizes / Batch Counts are set to 1, it means that all T2I will only be done 50 times.
In the case of the video in this article, with Batch sizes set to 4, it means that during the T2I process, 50 x 4 = 200 images will be generated in total.
Because AnimateDiff has modified i2ibatch, in the image processing part, AnimateDiff will use these 200 images for computation (whether this is the actual operation needs detailed examination of the source code, currently, it's my personal speculation).
According to the official Github, currently modifying Batch Sizes has no effect, and future support may be added. The official statement:

You do not need to change batch size at all when you are using this extension.

We are currently developing an approach to support batch size on WebUI in the near future.

Keep the Seed setting to a fixed value because you wouldn't want to output images with completely different styles.

ADetailer

Finally, we use ADetailer to fix the face.

I reduced the Inpaint denoising strength of ADetailer to 0.25 to ensure that the face is not excessively redrawn, causing flickering issues.

Generation

Regarding prompts, since we use IPAdapter, we can take the image from IPAdapter and use PNGInfo to obtain our prompts. Afterward, adjustments can be made to the prompts.

Finally, let's take a look at the output results.

0:00

/0:03

These three videos have been processed separately, and the output FPS is set to 16FPS. The different output settings are as follows:

Frame Interpolation = NO, Batch Size = 1
Frame Interpolation = FILM, Batch Size = 4
Frame Interpolation = FILM, Batch Size = 4, Stride = 4

You will notice that when using a higher Batch Size, the overall smoothness of the animation significantly improves.

Batch Size

When you are not using vid2vid but directly outputting text as a GIF, your Batch Size will determine the number of frames in the GIF (not Batch Counts). This is also explained in the Batch Size section of the author's documentation.

If you are using ControlV2V, moderately increasing Batch Size will bring good results. You can try experimenting with it.

Frame Interpolation = FILM / Interp X

When Frame Interpolation = FILM is enabled, in this article's data, I used the default value of Interp X = 10. The original video's Number of frames is 50. In the end, the generated GIF file contains a total of 491 frames.

In other words, with this option enabled, approximately 10 frames are interpolated for each frame of our original video. As a result, the final output GIF file becomes a slow-motion video.

The actual code operation is as follows:

film_in_between_frames_count = calculate_frames_to_add(len(frame_list), params.interp_x)

film_in_between_frames_count

So, the original video of about 3.2 seconds is extended to approximately 30.2 seconds after this interpolation process. You will need to use other tools to shorten the entire video length back to the original duration.

If you have FFmpeg installed, you can use setpts to speed up the video. In the case of this article's video, this command can be used to shorten it to the same speed as the original:

ffmpeg -i input.mp4  -filter:v "setpts=0.1*PTS" output.mp4

Additional Settings

If you encounter issues using --xformers, you can adjust the AnimateDiff settings. Using sdp may solve the problem.

Change AnimateDiff Optimize attention layers with sdp

Additionally, because it uses the vid2vid with t2i operation, if there are issues in generating images, opening this setting might solve the problem, and it also brings some performance optimization.

Enable Pad prompt/negative prompt to be same length

Conclusion

The operation of AnimateDiff in A1111 is not significantly different from ComfyUI. The only difference is that A1111 has packaged the intermediate connections, saving some time.

For friends with tight VRAM, it's still better to use ComfyUI. Otherwise, you can reduce the length of the video (total frames) each time and then use video editing software to concatenate the files at the end.