Click here to read  
about the FPIV logo

The First IEEE Workshop on 
Face Processing in Video
June 28, 2004, Washington, D.C., USA

main page workshop faces workshop proceedings CVPR 2004

Introduction to 
the First IEEE Workshop on Face Processing in Video

(in pdf)

What makes face processing in video special 

Since video-cameras became affordable and computers became powerful enough to process video in real-time, we have started to see a tremendous interest from both academia and industry to the vision-based human-oriented applications. These applications include public surveillance, information security, biometrics, computer-human interaction, multi-media, immersive and collaborative environments, video conferencing, video coding and annotation, computer games, entertainment, to name a few. 

A task of prime importance in all of these applications is analyzing video data for the presence of information about human faces. This involves such problems as face detection, face tracking, and, of course, face recognition. The problem of recognizing faces from video however should not be considered as a mere extension of the problem of recognizing faces in photographs, since there are a few principle differences between the two, in terms of both the nature of processed data and approaches used. 

On one hand, because of real-time, bandwidth, and environmental constraints, video processing has to deal with much lower resolution and image quality, when compared to photograph processing. Even assuming that the lighting conditions are perfect when taking a video snapshot, which is rarely true, the object of interest may be located too far from the camera or at angle which makes recognition very difficult. On the other hand, video images can be easily acquired and they can capture the motion of a person. This makes it possible to track people until they are in a position convenient for recognition. 

Besides that, image-based face recognition traditionally belongs to the field of pattern recognition, and as such is mainly driven by mathematical principles. By contrast, video-based face recognition can also be approached by using neurobiological principles, the study of which may hopefully result in making the performance of computer vision systems closer to that of biological vision systems.

The described difference between recognizing faces from photographs and recognizing those from video can be easily seen from Figure 1, which shows a photograph and a snap-shot of a {\em News} video program downloaded from the Internet. This figure can also be used to discover the way biological vision systems (such as that of the reader of this paper) approach a face recognition problem. -- For this purpose, the reader is invited to recognize the faces in the figure. 

a)     b)

 Figure 1. A test for examining the way facial recognition is performed by biological systems. - 
Try to recognize the faces in these images. 

(When trying to recognize the faces shown in (a) and (b), we first detect face-looking regions. Then, for the face in photograph (a), we rotate our face (or the page) to align our eyes with the eyes in the photograph, after which we might be able to recognize John Lennon in his last year of life. This is also a position which we would use, should we wish to memorize this face. For the image (b), which is a snap-shot of a {\em News} video program downloaded from the Internet, we can easily locate two faces but need to look very closely in order to see in the two persons Paul McCartney and Vladimir Putin (the video was taken shortly after the concert of the singer on the Red Square in May last year). We also note that difference in resolution and the quality between the photograph image (a) and the video image (b). The face orientation is another factor which makes recognition in the video difficult.)
The test is aimed at showing the classification and the hierarchy of face processing tasks as presented at and covered by this workshop.

As we try to recognize a face in an image or a scene, we notice the following division and hierarchy of face processing tasks. First we scan a scene to localize the areas where the face is located, which defines the face segmentation task. Then we approach the area of interest and detect the presence of a face there (the face detection task). Then we follow the face (the tracking task), until it appears in the position convenient for recognition, which, in the case of faces, is an eye-to-eye position (eye detection and face modeling tasks). Only then do we attempt to assert whether the face is familiar or not. If it is familiar, we recognize it (the recognition task), and if it does not look familiar, we memorize it (the memorization task). These and other face processing tasks are summarized in Figure 2.

Not claiming that is the exact order in which humans recognize faces, as, for example, facial expression and orientation can be retrieved without retrieval of the face position, this is the order used to organize the papers presented at the workshop.

Figure 2. Categorization and hierarchy of tasks performed in face processing in video. 


There were thirty papers selected out of 43 submissions for the presentation at the workshop. The papers are now retrieved from IEEEXplore digital library.

As it might be difficult to evaluate the video-based approaches presented by the papers by viewing only the video snap-shots shown, many authors have also submitted links to the actual video-demos which can be downloaded from the Internet for viewing. These links as well as the links to the related project's websites are made available at the workshop's website at  The bibtex file with the list of all workshop's papers is also made available at

A summary of the papers can also be found at  workshop's website: here and here.

About the workshop logo 

The logo designed for the workshop, which appears as an animated image at the workshop's website, is developed by the workshop's chair to illustrate some peculiarities of processing faces in video, which are the following. In video a face is often arbitrarily oriented and captured in low resolution and under poor lighting conditions. It can also be blurred because of motion. At the same time, video allows one to capture facial motion, which makes it possible to localize and recognize a face from blinking, for example. The canonical face representation, which is the base face representation used to memorize and recognize faces from video, is often eye-centered and uses only the central part of the face. Commonly it is also chosen to be of the lowest possible resolution under which the face is still recognizable. In particular, one of the most frequently used canonical face sizes is 24 x 24 pixels, which allows one to describe the natural symmetry of a human face using 16 equal blocks, with eyes being located in the intersection of the upper blocks and mouth located in the intersection of lower blocks. Face recognition on black-and-white images is just as good as recognition on colour images. Besides, many recognition techniques work on the binary features extracted from face. The image also shows that the eyes are the most salient features in a human face, capturing immediately the observer's attention, while hair is not. The image also shows that despite low and binary representation of the face, it is still possible for humans to classify it as being a face of a man or a woman, and that it is a face of the same person, even ... as the age difference between the two images is almost thirty years. 


Finally, I would like to thank all authors of the submitted papers. With their participation the First IEEE Workshop on Face Processing in Video becomes a real success and an inspiration for future workshops on this new and exciting area of research.


Dmitry O. Gorodnichy, FPIV'04 Program Chair


Copyright 2004