Diffusion models are a part of the generative models, which aim to generate data which as close as possible match the “real” distribution. This could be to make either realistic image, audio, video and in rare occurrences text & language.

The main idea? Well, it may be a bit confusing, but it takes real images and gradually adds noise to data (like turning a picture into random static) and then trains a neural network to learn how to reverse that process step by step, turning noisy data back into clean, realistic data. So the flow looks like this:

Real image → Add noise → Train model to reverse → Generate realistic image

To learn in depth how diffusion works, I recommend visiting https://poloclub.github.io/diffusion-explainer/ which has interactive parts + in depth explanation of the different steps.

Automate your camera trap image analysis with Animal Detect

Join today to analyze thousands of images in minutes using latest AI models.

Try for free

While there have been some super cool examples, how, for example audio has been generated with diffusion models, what I will focus on in this article will be image generation and augmentation, and how they can be used in real case wildlife scenarios.

Diffusion models have had an increasing use in image augmentation and generation of synthetic animal data. One example from 2024 is the blog post from Sara Olsson “Improving Camera Traps to Identify Unknown Species with GPT-4o” which explains how synthetic data can be used to fill gaps in training data of animals. Not all animals are equally easy to collect data for, which often results in some species are under-represented.

Another good example, is PolarBearWatchdog, by Lars Holst Hansen and Kim Hendrikse. The system was developed for early detection of polar bears in Greenland. I have had quite a few chats with Lars, which also nudged me to make this article. He has been extremely creative in finding ways to test ways how synthetic images and even videos works on Polar Bears. From working and testing out photoshop features to remove and generate different types of background, to different ways to argument images, to recently working much with diffusion models. While still keeping their training data from a true distribution Lars is still able to show how convincing these models are where its almost impossible to see the difference between a real and synthetic image with the naked eye.

They have a trained model, where they have been able to detect polar bears at the Zackenberg Research Station in Greenland. Their system attempts to stop human-animal conflicts by scaring polar bears away from places where humans live, such as the research station. If you want to see some real and synthetic images and videos from polar bears, I recommend you connect with Lars Holst Hansen on LinkedIn 😊

Can you spot the real and fake image from Lars LinkedIn Posts right away

Image of an actual Polar Bear detected by the PolarBearWatchSystem in Greenland (Image by Lars Holst Hansen)

Left side - real image, right side - synthetic AI generated image (Images by Lars Holst Hansen)

Which models are relevant today, and how can you get started with diffusion models – with some examples.

There are new models released almost DAILY so this article will soon be "outdated" with regard to some models. Just as i started this article until now, Sora 2 has been released, which has some fun and amazing examples of video generation.

But here is my best shot of today (5th October 2025).

ChatGPT (Image Gen) by OpenAI:

For image generation, and even augmentation, the easiest way to get started for most people is most likely ChatGPT. Not necessarily because it’s the best, but because it´s well-known and used by millions. And honestly, it does quite well, when it comes to image generation (you can use Sora by OpenAI for video generation)

For example, you need more data of servals. More orientations, and different angles.

My prompt examples in this article will be in a quite "low level" to simulate how much people would approach image generation models for the first time. For this image underneath the promt used is:

Can you keep this image, but add another serval cat walking the opposite direction? facing the current one, keeping style and lightning

Image on the left is a real image, whereas on the right, is an image generated by ChatGPT

You can also ask directly for the model to generate an image from scratch. It may be a good idea to learn how to ask for aspect ratios, to make it match the image size of a real cameratrap image. You can also ask to make the "footer" sections with date and time. As an example without any additional infromation (like size ratio):

Can you make a realistic image taken by a trail camera of a serval cat?

"Realistic trail camera image of a serval cat" - By ChatGPT

Honestly, the completely generated image of the serval looks quite stunning to me. But here is the catch, it often becomes "perfect" and thats not often the case for real trail camera images.

I would have loved to show an example of the completely "new" Sora 2 model for video, but at least for now, I am faced with the information "Sora is not available in Denmark yet - Please check back soon as we roll out Sora in new countries"

Google AI studio

Google is also one of the companies which seems quite active in the photo and video generation AI. Recently, with the photo generation (and argumentation) tool Gemini 2.5 Flash Image (Nano Banana) and their video generation tool Veo 3.

Nano Banana

Nano Banana is quite amazing at editing existing images, putting sunglasses on a person, where you give the original image. I would say that it´s not as good as ChatGPT to generate completely new wildlife images from a text from prompt. But damn, its good at modifying exiting image and keeping most details of background and environment. Seeing shadowing from the original image and correctly maps it to new items, changing the "environment" on the image and I would say is currently the best model for modifying exiting images.

As an example, I have an original image of a sunbear looking down in the ground. But I want an image of it facing the camera. I simply upload the original image and prompt "Please make the sunbear look towards the camera". And the result is quite amazing, but for the people who looks very closely, we will see that the texture of the fur of the bear got more "smooth" and I bet for those who looks even closer will find other small "artifacts". And obviously, as simple as "Please add a pink hat to the sunbear" the AI model will do it for you :)

Left side, original image of a sundbear. Middle image, Nano Banana generated to make the sunbear face the camera. Right side, Nano Banana put a pink hat on the sunbear.

Veo 3

Well so what about video generation? A video can be used for several frames, where the animal can be used in several different orientations. So how well can video generation models make a video of wildlife animals?

Can we make this trailcamera image of a hippo into a video? - Yes we can!

Original Image of an Hippopotamus

Prompt: Make the hippo walk to the right (same direction as its facing) and walk out of the scene

Lets go again. This time, can we make the hippo spin around in a video instead? So we get more angles and positions of the beautiful animal?

Original trail camera image of an Hippopotamus

Promt: Can you make the hippo spin around in a circle?

Again, as you see, we get the handheld kind of view, and the hippo never does a full circle. But again, impressive what the video generation model can do.

"Free" options

Most video and image generation models are hidden behind a paywall. I have just subscribed for 1 month free Google AI studios, which is still limited. So, what are the options if you need to generate tons of videos and images, and don’t want to spend money?

Well, there are several open source image video generation diffusion models you can go for such as:

Qwen-Image
Hunyuan-DiT
PixArt-Sigma
Open-Sora
Hunyuan-Video
VideoCrafter

Models are very different in size and GPU vRAM requirement and often requires quite some technical knowledge clone the GitHub repositoreies and get started. If you want a simpler UI to get started and play around with free diffusion models, you can install the Comfy interface: https://www.comfy.org/download

Here you will have several options, where the UI unlocks easy way to download and use different models. For example the Wan2.2 video generation model, from Alibaba's Tongyi Lab. It requires at leat 8 GB (12 GB recommended) vRAM on your GPU.

"A lion from a wildlife camera (trail camera) walking from one side of the scene to another"

AfspilAI generated video of a Lion by Wan2.2

You should also have expectations; there are some smaller image generation models which work fast.. but they may look like:

AI generated image of a Hippo from the Comfy UI (Can honestly not remeber exact model)

Others:

Now, I only selected to show a few diffusion models, and there exists hundreds, if not thousands by now such as:

Midjourney (image and video)
Adobe Firefly (image, with Photoshop/Illustrator integration)
Stable Diffusion / SDXL (image; open‑source ecosystem)
Runway Gen-3/Gen-4 (video; strong motion/camera controls)
Pika (video; fast, social-friendly)
Luma Dream Machine (video; natural‑language edits)
Ideogram (image; excellent text-in-image)

And they have some plus and minuses, so go explore :)

Limitations, suggestions, and reflection:

As you can probably see, the video and image generation diffusion models are sometimes extremely convincing. The synthetic data can be used to fill gaps in training data, which can make real differences in animal conservation.

Current "foundation models" are trained on a a broad, diverse dataset, which allows them to generate all kind of different types of images and styles, which sometimes becomes a problem when you are focused on a specific type of data. The diffusion models, has limitations, but are all the time improving, one day there may exist one which has been trained on millions of images from trail camera images and videos to make more convincing and realistic images. Currently, there are limitations, IR images, imperfections, occlutions and animal behaviour (such as blurriness from moist and raindrops, flash reflection, vissible deseases, interactions etc.)

Real data is always best for training diverse types of object recognition and classification models, but I strongly belive we will introduce more and more synthetic data to fill gaps, we will get uneaven distributions of animals of real datasets, as some animals are just more common than others.

I think it´s important to make sure we don´t overflow datasets with synthetic data, and we have guidelines to mark synthetic data and differieciate between "real and fake" data. I can see it beeing an ongoing problem and dicussion also for the foundation models. Before We were able to scrap the internet of images taken and made by humans, but it´s now overflowing with synthetic data, and what happens if you keep introducing more and more synthetic data to train more and new models? Well time will tell!

Diffusion models and how can they be for Animal conservation

Diffusion models are getting more realistic. Read more on these models work, and can be used to suppliment missing animal datasets.