Understanding Convolutional Pose Machines
Looking up human pose estimation online will eventually lead you to one of these papers: OpenPose, Stacked Hourglass Networks, Convolutional Pose Machines. We will look at CPMs as it is a very simple and neat approach.
Apparently, pose machines existed before deep learning took off and used random forests for classifying and handcrafted functions to extract features from images. Pose machines work in sequence, something like an RNN if you will, producing more precise predictions at the end of each stage. Now, assume pose machines with T stages.
With an image as input, the classifier produces P+1 belief maps, where P is the number of body joints whose locations we are required to predict and the +1 is for background class. A belief map is just a heatmap denoting probabilities of each pixel being that particular body joint. These initial belief maps are coarse, often incorrect and all over the place.
Stage 2 to T
At any stage t, the input to the classifier is the belief maps from stage t-1 and new features extracted from the image. This enables the pose machines to refine the prediction at every stage.
- x: image feature extractor for stage 1
- g: classifier
- x’: feature extractor for stages (≥2); same for all stages
- b: belief maps
- psi: converts belief maps into better features for next stage classifier
As a simple analogue, you can think of T men standing in a line. The first man is given a photo. He looks at it and writes “I think the head is in the top half of the image” and passes the photo along with the writing to the next guy. The second guy looks at the writing and the image, then writes “I think the head is in the top left quarter of the image” and passes the photo with the writing to the next guy. This process continues for the T men in line and hopefully, the last man’s thought is precise enough for us to pinpoint the head in the photo.
Adding the Convolution to Pose Machines
We just replace the classifiers and feature extractors in pose machines with few layers of convolution and max-pooling. Compare figure below with the normal pose machines for clarity.
The problem of small receptive field
Stage 1 classifier has only the image as input. And since kernel sizes (9x9) are comparatively smaller to the image size (368x368) we can say that the receptive field is very small. Experiments by the authors proved that increasing receptive field increased accuracy.
It was observed that belief maps of robust joints (head and shoulder) although noisy were consistent and acceptable whereas joints lower in the kinematic chain (elbows and knees) suffered.
Increasing receptive fields explicitly by increasing kernel size incurring more computation or more pooling layers at the cost of accuracy were put down in favour of adding more convolutional layers which implicitly provided further layers with a large effective receptive field. How? Information that is spatially apart in the image but contextually related will be encoded together by convolution as the input moves across layers. For instance, classifier when predicting the right elbow will somehow know the right shoulders position enabling it to make stronger predictions.
The classifiers of stages t (≥2) apply convolutions to the previous stage’s belief maps and use it as an input, this allows the network to encode correlations between joints and produce better predictions. In simpler words, the network learns that if the right shoulder is at the top right there is very little chance that the right elbow is at the left bottom. In this way, the network overcomes the problem of small receptive fields.
The problem of vanishing gradients
The network computes loss as L2 distance between the belief maps of the final stage and the ground truth, which are maps with Gaussians put over the actual ground truth locations.
Since the network is made of many stages the gradient updates propagated to early stages of the network are ridiculously small or 0. To circumvent this issue we provide intermediate supervision, meaning, instead of computing loss only at the end of the network, we compute loss (in the same manner as stated above) at each stage of the network.
From the above figure, without supervision, we can see that only stage 3 has non-zero gradients. But with supervision, the backpropagation is able to reach stage 1.
I hope I made the contributions of this paper understandable in an easy way. The paper uses a lot of notations and jargon which may make beginners shy away from reading it. Please do read the paper after you’re done with the article it’ll definitely boost your understanding.