What are Vision Language Models (VLMs)?

Think of them as the meeting point between a large language model (LLM) and a vision model like a Vision Transformer (ViT). During training, each image is tied to some text, sometimes it’s a correct caption, sometimes it’s a mismatched one, sometimes it’s a question and answer about the image, or short instructions. Over lots of examples, the system learns which text fits which image and how to talk about what’s in the scene.

Take a million photos of polar bears: some standing, some sitting, some waving a paw, some doing something totally different. On the language side, the model learns the words and how they relate, polar bear, paw, waving, snow, and so on, plus which attributes belong to the bear versus the background. On the vision side, a ViT chops each image into small patches (tokens) and uses attention to piece together the whole scene. A bit simplified, lets say one patch might capture fur texture and color that hints “polar bear,” another the face, another the paw. By themselves, each side learns patterns; together, the magic happens when we alignin them so the visual features line up with the right words.

There are two common ways this alignment gets used.

Automate your camera trap image analysis with Animal Detect

Join today to analyze thousands of images in minutes using latest AI models.

Try for free

In one setup, images and text are encoded separately and trained to pull matching pairs together and push mismatches apart. That’s great for things like image searching or labeling new images without extra training.
In the other setup, the language model directly looks at the visual tokens as it generates each word, letting it describe the image step by step and even reason about what’s happening.

When you combine the ViT and the LLM into a VLM, you essentially map those long visual embedding vectors, built from all the patches to language. The result is a richer, more grounded description of what’s going on in the image. For animal behavior and interaction, that means a system that doesn’t just spot “polar bear,” but can notice the raised paw, infer a gesture, relate it to the snowy context, and explain or respond in plain language. That’s why VLMs are so promising for this field.

Don´t take my word for it... Let´s see some examples! If we for example, take this image from two anteaters and try to ask ChatGPT, which uses a VLM to analyze images, something simple like "what is going on in this image" and wait a few seconds to let the magic happen.

Couldn´t read the text on the video? I got you ;)

"This image appears to be from a camera trap (trail camera) — timestamped “2025-02-22 2:04:22 AM” and marked with the “Reconyx” logo, which is a known wildlife camera brand.

It shows two animals interacting at night. Based on their body shapes, long snouts, and claws, they seem to be anteaters (likely Tamandua tetradactyla or Myrmecophaga tridactyla).

One anteater (left) is lunging or swiping with its front legs extended, while the other (right) is standing upright in a defensive or playful pose with its forearms raised — a common anteater defensive behavior when threatened.

So, most likely this image captures a confrontation or dominance interaction between two anteaters, possibly males competing for territory or mates."

I don't know about you but... in lack of words, I just want to say "WOW". Here the VLM solves several problems at once.

It worked as an OCR (Optic Character recognition) and was able to extract the date and time AND the camera brand from the logo.
It was able to determine that it was at night, possibly combining the IR image type, with the time.
It did animal CLASSIFICATION and mapped it to two possible species of anteaters.
It measured two animal interacting "two animals interacting at night"
It found behaviors "confrontation or dominance interaction between two anteaters"

VLM alternatives + open source

ChatGPT is far from the only VLM, but currently (11th of October 2025) at least from my own testing, it seems to be the best which I have tested on with animal images.

While at scale, ChatGPT is not the way forwards for you, maybe because of price or configurability, don't worry, there are some free alternatives :)

Qwen3 VL is an opensource VLM which has recently been released. I previously used Qwen2.5-VL that I came across working with Chen Li on a robotics project, which I tested out on some animal images recently. When working with open source VLMs like Qwen3-VL you most certainly want to create instructions to the model. These instructions will give "guidelines" on what the VLM should focus on and how to answer to your questions. You can learn about a bit more about VLM promting here: https://towardsdatascience.com/prompting-with-vision-language-models-bdabe00452b7/

From the testing on Qwen2.5-VL I noticed a few very impressive cases, but as I moved into more grainy, unclear and partially occluded animals, it started to hallucinate. I also noticed that often when an animal was just having its neck towards the ground, it automatically assumed it was eating.

Some other options are: LLaVA lineage, InternVL and Pixtral but there are plenty more! Also, you have many other close-source alternatives to ChatGPT such as Gemini and Claude.

The future of VLMs for animal behavior and interaction

In the robotic world, we see a lot of finetuned models of VLMs, such as RoboPoint https://robo-point.github.io/. While I see it taking some time before we see VLMs made from the scratch up, specialized for animals, I do see that we will find more finetuned models for animals.

And these finetuned models seem to slowly emerge:

Animal-CLIP https://github.com/PRIS-CV/Animal-CLIP
AnimalMotionCLIP https://arxiv.org/abs/2505.00569
MouseGPT - A VLM specialized in mouse behavior https://arxiv.org/abs/2503.10212

Disclaimer: I have not tested any of these models so I cannot say how they compare to just using Qwen3-VL or another open source model, with some instruction promoting.

I also think that "grounding" work, such as using a CNN model to find the animals on the image and even classify them before parsing them to a VLM may be the step forwards to more correct results.

Vision Transformers and eventually Vision Language models will be a part of my ongoing PhD, so I will most likely have updated and new blog articles coming on the topic :)

The AI future for animal behavior & interaction?

Read to figure out how Vision Language Models (VLMs) be used to extract species interaction and behaviours from animal images

What are Vision Language Models (VLMs)?

Automate your camera trap image analysis with Animal Detect

VLM alternatives + open source

The future of VLMs for animal behavior and interaction

About Hugo Markoff

Read next