Understanding Tasks in Diffusers: Part 1 : Aritra Roy Gosthipaty and Ritwik Raha

Understanding Tasks in Diffusers: Part 1
by: Aritra Roy Gosthipaty and Ritwik Raha
blow post content copied from  PyImageSearch
click here to view original post



Table of Contents


Understanding Tasks in Diffusers: Part 1

In this tutorial, you will learn the basics of different tasks and the corresponding models and pipelines inside Hugging Face Diffusers.

This lesson is the 1st of a 3-part series on Understanding Tasks in Diffusers:

  1. Understanding Tasks in Diffusers: Part 1 (this tutorial)
  2. Understanding Tasks in Diffusers: Part 2
  3. Understanding Tasks in Diffusers: Part 3

To learn how to make the most of your diffusion pipeline and work with the latest, just keep reading.


Understanding Tasks in Diffusers: Part 1

Welcome to your guide for image generation with Diffusers, a state-of-the-art toolkit for exploring and harnessing the power of diffusion models. In this tutorial, we will look at tasks inside the Hugging Face Diffusers Library.

What will we cover?

  • Unconditional Image Generation
  • Text-to-Image Generation
  • Image-to-Image Generation

In a previous tutorial, we looked at how to get started with the diffusers library and understand, pipelines, models, and schedulers. Here, we will make use of what we previously learned and understand some fundamental tasks in the Diffusers library.


Configuring Your Development Environment

To follow this guide, you need to have the diffusers and accelerate libraries installed on your system.

Luckily, diffusers is pip-installable:

$ pip install diffusers
$ pip install accelerate

Setup and Imports

A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together.

Hugging Face: Diffusers Overview

First, we install diffusers and accelerate and import the necessary libraries for the pipelines.

This includes the following:

  • torch: be sure to have the torch>2.0
  • AutoPipelineForText2Image and AutoPipelineForImage2Image: from the diffusers library
  • load_image and make_image_grid: as image utilities
import torch
from diffusers import DiffusionPipeline
from diffusers import AutoPipelineForText2Image
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
from diffusers.utils import make_image_grid
device = "cuda" if torch.cuda.is_available() else "cpu"
generator = torch.Generator(device).manual_seed(31)

We set the device type to “cuda” if a cuda kernel is available or else fall back to “cpu”.

A torch Generator object enables reproducibility in a pipeline by setting a manual seed.


Unconditional Image Generation

Imagine conjuring up images out of thin air without even a single word as a guide. That’s unconditional image generation, a technique in which diffusion models paint pictures based solely on what they learned from their training data.

Unconditional image generation generates images that resemble a random sample from the images in the training data. This is because the denoising process in the diffusion model is not guided by a text prompt or any other parameters.

Here, we use the DiffusionPipeline to load the google/ddpm-celebahq-256 that has been trained with the faces of celebrity images.

model_id = "google/ddpm-celebahq-256"
pipeline = DiffusionPipeline.from_pretrained(model_id).to(device)
images = pipeline(generator=generator).images

We use the make_image_grid utility function to show the generated image.

make_image_grid(images, rows=1, cols=1)
Figure 1: A generated celebrity face (source: image by the author).
pipeline.to("cpu")
del pipeline
torch.cuda.empty_cache()

In the above lines, we offload the pipeline to cpu and then delete the pipeline. This is done so that the cuda kernel is freed to perform other operations and load models as required.


Text-to-Image Generation

When you think of diffusion models, text-to-image is usually one of the first things that come to mind. Text-to-image generates an image based on the images it has been trained on but also being guided by a text-prompt.

Here, we use the AutoPipelineForText2Image to load the stabilityai/stable-diffusion-xl-base-1.0 model.

Stable Diffusion XL (SDXL) is a much larger variant of the Stable Diffusion models. It involves a two-stage model process that makes the generated image even more detailed. One can even make use of micro-conditionings within the model to generate fine-grained images.

Let us know if you are interested in a complete overview of Stable Diffusion Models.

Please let us know your choice here.

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipeline = AutoPipelineForText2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    variant="fp16"
).to(device)

In the pipeline that we just defined, we can pass in our prompt and the generator we defined in the setup.

prompt = "Ernest Hemingway in watercolour, cold color palette, muted colors, detailed, 8k"
images = pipeline(
    prompt,
    generator=generator
).images

Finally, visualize the image when the generation is complete using the make_image_grid functionality,

make_image_grid(images, rows=1, cols=1)
Figure 2: Ernest Hemingway generated from a text; he would have been proud! (source: image by the author).
pipeline.to("cpu")
del pipeline
torch.cuda.empty_cache()

Again, we offload the pipeline and delete it to preserve resources.


Specifying Parameters

Something that can be additionally helpful is specifying parameters that allow us to determine exactly how we want our images.

  • height: indicating the height of the generated image
  • width: indicating the width of the generated image
  • guidance_scale: indicating how much the prompt influences image generation; a lower guidance_scale means more creative, and a higher guidance_scale means following the prompt to the ‘t’
  • negative_prompt: indicates objects or visuals to steer away from while generating
model_id = "runwayml/stable-diffusion-v1-5"
pipeline = AutoPipelineForText2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    variant="fp16",
).to(device)
prompt = "Ernest Hemingway in watercolor, warm color palette, few colors, detailed, 8k"
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
images = pipeline(
    prompt,
    height=512,
    width=512,
    guidance_scale=7,
    negative_prompt=negative_prompt,
    generator=generator,
).images

We visualize the image after it has finished generation.

make_image_grid(images, rows=1, cols=1)
Figure 3: Same image but with a different prompt and parameters (source: image by the author).
pipeline.to("cpu")
del pipeline
torch.cuda.empty_cache()

Image-to-Image Generation

Image-to-image is another application similar to text-to-image, but we can also pass in an input image in addition to the prompt.

The initial image is encoded to latent space, and noise is added to it.

The latent diffusion model starts with a “noisy” version of an image. It then analyzes this messy image and the provided prompt (if any) to understand what the potential final image should look like. Based on this understanding, the model predicts the specific “noise” that was added to the image during the blurring process. Finally, it subtracts this predicted noise from the original messy image, giving us a clearer, more defined image.

Finally, a decoder transforms the new latent representation back into an image.

model_id = "kandinsky-community/kandinsky-2-2-decoder"
pipeline = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
).to(device)

Here, we fetch the image from a url using the load_image utility and then pass the image and the prompt to the pipeline.

init_image = load_image("https://i.imgur.com/gq6q4CJ.jpg")
prompt = "animated Mona Lisa, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
images = pipeline(
    prompt,
    image=init_image
).images
make_image_grid([init_image, images[0]], rows=1, cols=2)
Figure 4: A Pixar Monalisa generated from prompt (source: image by the author).
pipeline.to("cpu")
del pipeline
torch.cuda.empty_cache()

Stable Diffusion XL (SDXL) Model

Let us see how the same image turns out with an SDXL model with similar parameters.

TL;DR

  • A Stable Diffusion model generates an image
  • It does this by iteratively adding noise to an image and then removing noise to regain the original image
  • In the process, it learns how to remove noise
  • The model that learns this can be conditioned on text
  • It can also be conditioned on other types of images, like depth maps, image masks, and sometimes just patches of colors.

SDXL is a larger and more powerful version of Stable Diffusion v1.5. This model can follow a two-stage model process: the base model generates an image, and a refiner model takes that image and further enhances its details and quality.

First, we load the model in the pipeline using the AutoPipelineForImage2Image pipeline. Next, we pass in the prompt and the initial image as a starting point.

model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
pipeline = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    variant="fp16",
).to(device)
init_image = load_image("https://i.imgur.com/gq6q4CJ.jpg")
prompt = "animated Mona Lisa, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
images = pipeline(prompt, image=init_image, strength=0.5).images

As usual, we visualize the images after the generation is complete.

make_image_grid([init_image, images[0]], rows=1, cols=2)
Figure 5: Same prompt but with a different model (source: image by the author).
pipeline.to("cpu")
del pipeline
torch.cuda.empty_cache()

A Closer Look at Pipeline Parameters

As we saw in this tutorial, one of the factors determining the quality of generated images inside the diffusers library is the pipeline parameters. Let’s take a closer look at how these parameters affect image generation.

  • Strength: strength is one of the most important parameters. It determines how much the generated image resembles the initial image.
  • A higher strength value: makes the image more creative, a value closer or equal to 1.0 means the initial image is ignored.
  • A lower strength value: means the generated image is more similar to the initial image.

Note: A little caveat to note here is that the strength and num_inference_steps parameters are related. If the num_inference_steps is 20 and strength is 0.6, then this means adding 12 (20 * 0.6) steps of noise to the initial image. This also means denoising for 12 steps to get the final image.

  • Guidance Scale: The guidance_scale parameter is used to control how closely aligned the generated image and text prompt are. A higher guidance_scale value means your generated image is more aligned with the prompt, while a lower guidance_scale value means your generated image deviates more from the prompt.
  • Negative Prompt: A negative prompt conditions the model to not include things in an image, and it can be used to improve image quality or modify an image. We can also use this to be a safety guardrail for the model.

Let’s see how we can combine these parameters for the SDXL model.

model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
pipeline = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    variant="fp16",
).to(device)

Select and experiment with the parameters to see how they affect the generation.

init_image = load_image("https://i.imgur.com/gq6q4CJ.jpg")
prompt = "animated Mona Lisa, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"

images = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    guidance_scale=8.0,
    strength=0.2,
    num_inference_steps=40,
    image=init_image
).images

Finally, visualize the image.

make_image_grid([init_image, images[0]], rows=1, cols=2)
Figure 6: Experiment with the prompt and parameters to see how the image changes (source: image by the author; Monalisa source: https://commons.wikimedia.org/wiki/Category:Mona_Lisa).

What's next? We recommend PyImageSearch University.

Course information:
84 total classes • 114+ hours of on-demand code walkthrough videos • Last updated: February 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

  • 84 courses on essential computer vision, deep learning, and OpenCV topics
  • 84 Certificates of Completion
  • 114+ hours of on-demand video
  • Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
  • Pre-configured Jupyter Notebooks in Google Colab
  • ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
  • ✓ Access to centralized code repos for all 536+ tutorials on PyImageSearch
  • Easy one-click downloads for code, datasets, pre-trained models, etc.
  • Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University


Summary

We are at the point in history where human creativity is no longer limited by skill, perseverance, or generational talent but by imagination, vocabulary, and the power of computing. It is on us to learn how to utilize these tools to the best of our potential.

In this tutorial, we focused on how to leverage different pipelines for simple tasks in the Hugging Face diffusers library. The tasks involved are:

  • Unconditional Image Generation
  • Text-to-Image Generation
  • Image-to-Image Generation

The next two parts of this series will focus on how to do:

  • Image Inpainting
  • Depth to Image
  • ControlNet
  • Some advanced tasks

Did you find value in this tutorial, and do you want us to cover more tutorials on diffusion and stable diffusion in the future?

Let us know in the survey here.


Citation Information

A. R. Gosthipaty and R. Raha. “Understanding Tasks in Diffusers: Part 1,” PyImageSearch, P. Chugh, S. Huot, and K. Kidriavsteva, eds., 2024, https://pyimg.co/5azv1

@incollection{ARG-RR_2024_Understanding-Tasks-in-Diffusers-Part-1,
  author = {Aritra Roy Gosthipaty and Ritwik Raha},
  title = {Understanding Tasks in Diffusers: Part 1},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Kseniia Kidriavsteva},
  year = {2024},
  url = {https://pyimg.co/5azv1},
}

Featured Image

Unleash the potential of computer vision with Roboflow - Free!

  • Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
  • Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
  • Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
  • Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
  • Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.

Join Roboflow Now


Join the PyImageSearch Newsletter and Grab My FREE 17-page Resource Guide PDF

Enter your email address below to join the PyImageSearch Newsletter and download my FREE 17-page Resource Guide PDF on Computer Vision, OpenCV, and Deep Learning.

The post Understanding Tasks in Diffusers: Part 1 appeared first on PyImageSearch.


February 26, 2024 at 07:30PM
Click here for more details...

=============================
The original post is available in PyImageSearch by Aritra Roy Gosthipaty and Ritwik Raha
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce