|
Introduction
Archive films provide a particularly challenging setting for detecting and
tracking people: the videos are low in quality and lack color, cameras move,
and there are many occlusions and crowded scenes. Face detection is a good
starting point for these films as people in there show their faces (most of the
time).
In this work we show an interesting example of how face detection and face
tracking help each other: (1) detection makes tracking easy by (mostly)
operating at the object level, hence independent of changes in low-level
signals; (2) tracking makes detection easy because most false positives are
isolated and can be removed by inference on temporal coherence. Even simple
algorithms lead to robust detection and tracking results.
|
|
|
|
|
|
|
Top: single-frame face detection is far from perfect, with a lot of false
detections. Bottom: we track faces and integrate face scores temporally
to remove false positives and recover misses.
|
Face Tracking by Detection
Face tracking in the archive films can be very hard: there are large variations
in face pose and illumination, and (low-level) appearance-based tracking often
fails at short distances. By tracking at the object (face) level, however, the
algorithm avoids dealing with these variations: as long as it is a face that
moves smoothly, we can track its location with no trouble.
Of course, there are many difficult situations where face detection does not
work, such as when the person is facing away, in a weird angle, or when the
illumination is bad. In such cases, we can switch to low-level tracking (using
correlation) and continue. Often the "gaps" are short; soon their faces will
re-appear, and the algorithm can continue back to the object tracking mode.
Our actual tracking algorithm is simple: use the Viola-Jones detector with a
low "threshold" to find all possible faces in each frame, and track these faces
with a Kalman Filter. We use a conservative strategy for initialization, i.e.
each potential face starts a face track if not matching existing tracks. On the
other hand we use an aggressive greedy data association strategy: we estimate
how "good" the face tracks are, and allow best tracks to match candidate faces
first.
Face Detection by Tracking
Tracking establishes the correspondence between candidate faces across frames.
There are many false positives from the detection stage, but most false
positives are isolated and form short tracks. Good faces form long tracks and
they reinforce each other, increasing their likelihood of being a true face.
Enforcing temporal consistency is not a trivial issue. Most previous works use
simple strategies, e.g. thresholding on the track length. There are several
factors at play: (1) the face confidence score of each face detection; (2) the
track length; (3) the confidence in tracking/temporal correspondence. In
particular, as we are tracking all candidate faces, there are numerous tracks
and we cannot assume the tracking to be 100% correct. Tracking can fail,
especially in low-level tracking model over long "gaps" between detections.
A systematic way to combine these factors is to use a learned probabilistic
model; in this case we use a one-dimensional conditional random field.
Inference is standard and stochastic gradient descent gives parameters
consistent with intuition. We compare the CRF integration model with several
baselines, such as using track length, using average score, using maximum
score, or using a local (short-range) SVM classifier.
Results on Full-length Films
Groundtruth faces are labeled in three full-length films (every 100th frame),
used to both train the models and to evaluate (final) detection performance.
The CRF model greatly improves face detection (90% average precision, 70%
recall) over single-frame detection (60% precision, 58 recall). It also
outperforms various baseline models for temporal integration.
|
|
| (a) Casablanca Temporal vs Static
| (b) Casablanca CRF vs baselines
|
|
|
| (c) Kind Hearts and Coronets
| (d) The Great Dictator
|
Here are some sample video results. The tracks are automatically selected -- showing high recall with few false positives.
Casablanca
Coronets
Dictator
References
- Finding People in Archive Films through Tracking.
[abstract]
[pdf]
Xiaofeng Ren, in CVPR '08, Anchorage 2008.
- Face detection and tracking in a video by propagating detection probabilities.
R. Choudhury, C. Schmid, and K. Mikolajczyk, PAMI 25(10), 2003.
|