Wildlife TechTutorial

How Far is This Animal? Depth Estimation for Wildlife Images

Compare stereo, 3-D and AI monocular techniques to gauge animal distance in camera-trap images—MegaDetector + Depth Anything demo for wildlife surveys.

Hugo Markoff
Hugo Markoff
How Far is This Animal? Depth Estimation for Wildlife Images

I’ve always enjoyed talking to people who truly understand the challenges within wildlife conservation. While my background includes plenty of work on distance estimation with cameras mounted on robots, I never thought about applying similar concepts to wildlife images, until Piotr Tynecki reached out. He asked if I had any plans to share insights on automated distance estimations and compare different methods, pointing out how valuable this could be for the conservation community.

That got me thinking: how can we measure the distance between a camera and an animal in a way that’s accurate, efficient, and practical for wildlife applications?


Why does distance estimation matter?

To be honest, I hadn’t read much about the role of distance estimation in wildlife studies. My knowledge came from digging into some key papers that highlighted its importance, shared by Piotr: Overcoming the Distance Estimation Bottleneck in Estimating Animal Abundance with Camera Traps and A semi-automated camera trap distance sampling approach for population density estimation.

Here are some consepts which I got from the papers:

  • Manual distance estimation methods are often error-prone and labor-intensive, necessitating the development of automated solutions.
  • Automated distance estimation from camera trap data improves detection probability estimates, enabling more robust population density calculations.

As Piotr summed it up:

  • "The objective is to utilize a random encounter model (REM) to calculate sample sizes in the area. To achieve this, we must not only detect animals and predict their species but also measure their speed of movement and the distance from the camera."

It was clear this wasn’t just an interesting topic, it was a critical one for conservation work.


How can we estimate distance from a camera?

There are many ways to estimate distance from images, and advances in deep learning have made it possible to derive depth information from a single 2D image. Let’s explore some methods, from traditional techniques to modern AI-based solutions, and see how they might work in wildlife scenarios.

1. Stereo Cameras (Or Two Cameras)

2 Animal Detect cameras, with known distance between the cameras (26CM)- example setup

Stereo cameras are one of the most widely used methods for depth estimation, and it’s not hard to see why, they’re simple, effective, and relatively inexpensive compared to more sophisticated systems. Inspired by how human eyes perceive depth, this technique uses two cameras positioned at a known distance from each other to capture slightly different perspectives of the same scene. By analyzing the disparity (the difference in position of objects in the two images), it’s possible to create a depth map and estimate distances.

Let’s take a moment to understand this with an example from human vision. Our eyes are spaced apart by roughly 6.5 cm on average, and this separation allows us to perceive depth in 3D. When you look at an object, each eye captures a slightly different image of it. Your brain fuses these two images together, compares the differences, and calculates the depth of the object relative to your position.

For instance, think about how you estimate distances in daily life. You might look at an object and instinctively compare it to something familiar. If it looks about as tall as a door frame (usually around 2 meters), you can guess it’s likely close to that height. Similarly, when walking, you might estimate each big step to be about 1 meter and use that to approximate distances. Our brains are constantly using these tricks to calculate depth. Stereo cameras work on the same principle but rely on calculations rather than intuition.


How Stereo Cameras Work

In the case of stereo cameras, the two cameras act like your eyes, capturing images from slightly different angles. The distance between the two cameras (known as the baseline) is a critical factor in determining depth. By comparing the positions of objects in the two images, software can calculate a disparity map, which is then used to estimate distances.

To test this, I used a RealSense D435 camera:

Intel RealSense D435 with 5cm distance between the left and rigth camera

Which has a built-in stereo pair. I placed a chair 2 meters away from the camera and measured the estimated depth. The result? It ranged from 1.92 to 2.05 meters, depending on the pixel locations in the depth map. My cat Jenos also made an appearance, and the camera estimated her distance to be around 1.57 meters. The results were surprisingly accurate!

Lets see how it looks like with some images :)

Yes, grayscale I know... Last image looking blurry? Good! The wall behind does not look "double"? Bad :(

As a depth map this is what we get:

Beautiful arrow showing what the distance was to Jenos (cat) on the selected pixel (359, 388) (x, y)

However, I did notice that certain areas of the image (like the plain wall behind the objects) didn’t produce usable depth estimates. This is because stereo cameras need texture or identifiable features to compare the two images. If there’s nothing for the cameras to “latch onto,” like a uniform-colored wall, the depth map becomes less reliable.

Why Are Stereo Cameras So Popular?

Stereo cameras are often the go-to solution for depth estimation in robotics and other applications because they strike a perfect balance between cost and effectiveness. Unlike 3D cameras or advanced depth sensors, stereo setups don’t require expensive hardware or complex calibration processes. They rely on straightforward principles and can be implemented with relatively affordable equipment.


Challenges in Wildlife Applications

While stereo cameras work well in controlled environments, they present some challenges in wildlife scenarios:

  1. Lack of Stereo Wildlife Cameras: Most commercially available wildlife cameras don’t come with stereo functionality, so researchers would need to use two separate cameras.
  2. Synchronization Issues: For a stereo setup to work, both cameras need to capture images of the same animal at the exact same time. This is difficult to achieve in practice, as animals move unpredictably.
  3. Environmental Limitations: Stereo cameras struggle in low-light conditions or when there’s little texture in the scene, which is common in natural habitats.

DIY Stereo Camera Solution

If you’re determined to use stereo vision for wildlife studies, you can try setting up two cameras with a fixed baseline. For example, you could mount two wildlife cameras a known distance apart and manually synchronize the images. While this is technically feasible, it’s not an ideal solution due to the challenges mentioned above.


2. 3D Cameras (Time-of-Flight, Infrared, etc.)

3D cameras use technologies like Time-of-Flight (ToF) or infrared projection to calculate depth. While these are widely used in robotics, they aren’t practical for wildlife cameras due to cost, power requirements and enviroment.

However perfect time for me to include some robots! While using 3D vision in several project here is one of my favorite ones, where we used a 3D camera to get pointclouds to estimate height/depth of stair steps for a robot to climb :)

Pointcloud data of a stair, taken by an Intel Realsense camera. I think the FPS counter in the corner is broken.

For those who want to see a 6 legged robot climb stairs: https://www.youtube.com/watch?v=zDJ8LW1Ounk enjoy :D


3. Monocular (Single Camera) Solutions

This is where things get exciting! Monocular depth estimation uses a single image to generate a depth map, powered by deep learning models and CNNs. I was impressed by the results of tools like Distance Estimation from the paper “Overcoming the Distance Estimation Bottleneck in Estimating Animal Abundance with Camera Traps” shared here. They have a GitHub repository where you can get a depth estimate program to try out yourself: GitHub Link

I tried to test this out myself.. But…

After struggling to get the Linux version working on Ubuntu (possibly due to OS compatibility issues), and failing with Python3.12 (could be too new version), I switched to Windows and got it running. However, I found the process overwhelming when the program started and showed me this:

Distance Estimation program for wildlife images - can be found by clicking the GitHub link above

So instead, I decided to build my own pipeline.


My Depth Estimation Pipeline

Here’s the simplified workflow I developed:

  1. Object Detection: Use MegaDetectorV5 to identify animals or humans in the image with the highest confidence.
  2. Depth Map Generation: Generate a depth map using a Depth Anything model.
  3. Depth Conversion: Convert the depth map values into real-world measurements based on at least 1 known distance.
  4. Distance Estimation: Calculate the distance to the detected object using the center pixels of the bounding box.

How does it look like?

Visual steps my pipeline goes through to estimate Animal Distance from the camera

Results: While testing in my garage, the pipeline produced depth estimates comparable to those from a stereo camera setup. I haven’t validated the accuracy outdoors yet (no measuring tape on hand!), but I’ll share my code and a demo in the next post so you can try it yourself.

Result image from my pipeline, with distances to a known object and the Fox

NOTE! Keep in mind that this was done as an experiment of how some of the newest tools can be used and was a weekend project for me and I have left out several points, such as multiple animal detections, several known distances, real life tests and results.


Other Creative Solutions

If you’re looking for a simple yet effective setup, consider combining cheap sensors with cameras:

  • Attach ultrasonic or IR sensors to a plank (e.g., 3 meters long) spaced 1 meter apart.
  • Log distance data from the sensors alongside images to estimate speed and movement.

Pros:

  • Affordable and easy to set up.

Cons:

  • Not visually discreet, which could disturb wildlife.

Takeaways

Depth estimation for wildlife images is challenging, especially with the limitations of existing camera traps. While traditional methods like stereo cameras or 3D imaging aren’t always practical, deep learning models offer a promising alternative. By automating distance estimation, we can improve density calculations, detection probabilities, and overall wildlife conservation efforts.

Let me know your thoughts or if you have experiences to share! Stay tuned for my next post with a GitHub link to my pipeline. Let’s innovate together for wildlife conservation. 😊

Hugo Markoff

About Hugo Markoff

Co-founder of Animal Detect, the dog man.