Understanding Convolutional Pose Machines

Pose Machines

Apparently, pose machines existed before deep learning took off and used random forests for classifying and handcrafted functions to extract features from images. Pose machines work in sequence, something like an RNN if you will, producing more precise predictions at the end of each stage. Now, assume pose machines with T stages.

Stage 1

With an image as input, the classifier produces P+1 belief maps, where P is the number of body joints whose locations we are required to predict and the +1 is for background class. A belief map is just a heatmap denoting probabilities of each pixel being that particular body joint. These initial belief maps are coarse, often incorrect and all over the place.

Stage 2 to T

At any stage t, the input to the classifier is the belief maps from stage t-1 and new features extracted from the image. This enables the pose machines to refine the prediction at every stage.

pose machines flow chart from the paper

Notations explained

  1. x: image feature extractor for stage 1
  2. g: classifier
  3. x’: feature extractor for stages (≥2); same for all stages
  4. b: belief maps
  5. psi: converts belief maps into better features for next stage classifier

Adding the Convolution to Pose Machines

We just replace the classifiers and feature extractors in pose machines with few layers of convolution and max-pooling. Compare figure below with the normal pose machines for clarity.

CPM flow chart from the paper. In the bottom part of the figure, you can see that the effective receptive field increases after multiple convolutions

The problem of small receptive field

Stage 1 classifier has only the image as input. And since kernel sizes (9x9) are comparatively smaller to the image size (368x368) we can say that the receptive field is very small. Experiments by the authors proved that increasing receptive field increased accuracy.

The problem of vanishing gradients

The network computes loss as L2 distance between the belief maps of the final stage and the ground truth, which are maps with Gaussians put over the actual ground truth locations.

histogram of gradient magnitudes. red — without intermediate supervision, black — with



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Data Science at ShareChat. Ola. IIT Madras.