This is the first in a series of articles I hope to write about the working and applications of depth and 3D in vision.
Let’ motivate ourselves on why we need to study/understand depth.
We as humans are able to perceive this world in 3 Dimensions. This 3D information is used by us in day-to-day trivial tasks such as picking something from a table, not hitting the furniture while moving around the house as well as skilful tasks like throwing the basketball into the hoop, driving in a high traffic zone etc.
But our eyes seem to give us only a 2D image of the world around us. Yet you somehow know how far to stretch your hand to reach for the TV remote and you cannot explain how you knew it. When an expert cannot explain it’s expertise Intelligence comes into play. Which is why you can see a plethora of Machine learning techniques for vision problems and also makes it a fundamental problem in Computer Vision.
Mathematically, Intelligence is studied as pattern recognition and our brains are damn good at it. Here on we’ll discuss high-level patterns that our brain recognises and uses as cues to estimate depth.
Binocular Cues
Retinal Disparity — Due to the gap in between our eyes the images seen by the left eye and right eye vary slightly. (Close and open each of your eyes alternatively to confirm this). The gap causes each eye to form images with different perspectives which our brain has evolved to stitch together and infer depth information from them.
This effect is leveraged to make Stereoscopic 3D movies that you watch with glasses. A stereoscopic camera has 2 lenses horizontally apart effectively capturing footage from 2 perspectively. Later editors use video processing software that combines the recorded footage so that it can be viewed with anaglyphic (red/blue) glasses.
Convergence — Both our eyes always look at the same point in the 3D space. It has the ability to converge when we shift our gaze to look at objects closer to us. (Slowly move your forefinger closer to your nose while focusing on it, you can sense something under your eyes). This converging/diverging tells us how far an object is.
Monocular Cues
Sometimes there are easier patterns that the brain recognises that don’t require the effort of 2 eyes or other times where you can’t use information from both eyes. Consider the case of watching a movie, the 2 eyes see the same image cause the screen is flat and offers only one perspective yet you know the relative depth of objects in the scene (i.e if something is behind or in front of another thing).
Relative Height and Size — In the above painting, you can say the boats that are in the bottom are nearer to us and those that are in the middle (closer to horizon) are farther. Also that the smaller boats are farther and larger boats are nearer, assuming that the boats are nearly the same.
Light and shadow — If I ask you to determine the closest point on this sphere you’d easy pinpoint it to a good estimate. Here you used information from the effects of light such as shadows, specular reflection, colour bleeding etc.
Sharpness (Blur) — This effect is a consequence of the convergence property of our eyes. This effect can be replicated using cameras (known as DoF or depth of field) and has been in constant use in film making and photography for viewers to focus on the subject of the image, effectively serving as a monocular depth cue.
Texture gradient — In scenes like grass or flower fields, the clarity of objects is good near to us and seems like just noise towards the farthest point (horizon). Thus a clarity gradient is established which correlates with depth.
Occlusion —When one object hides part of another object we are able to tell the latter is in the front and the hidden object is behind.
Motion Parallax — Whenever you are seated in a window seat, other vehicles seem to pass by you very fast, whereas the mountains or other distant objects pass by very slowly or almost never pass by like the moon. This difference in relative speed of motion of different objects tells us how close they are to us.
Accommodation — The ability of the eye to focus at different distances. This focusing requires certain eye muscles to relax or stretch. The brain uses this muscle movement as a sensor to gauge the depth of the object currently in focus.
That’s all for folks. I wrote this article so that people solving vision problems can effectively leverage these cues to make better and robust algorithms. Thanks for reading. Suggestions/Feedback are welcome in comments.