# Probabilistic Fusion for Person Recognition

The second contribution of my Master thesis was to incorporate information from face recognition, clothing based recognition, and speaker recognition (as the three recognition modalities) in addition to constraints like two people cannot occur at the same time in a probabilistic manner. We used Markov Random Fields described almost a year ago when I was starting to learn what they are! ðŸ™‚

Graphical Model for Person Recognition

As is seen in the graph above, the TV series is first split into scenes, using the special sequences in TBBT, and then split into shots (abrupt camera / view changes). Within each shot, each person track is associated with a random variable vector “ID”, taking on values in the range of 0 to 1 depending on the supporting evidence. The node F provides face information (if it exists) and C provide clothing information to each person identity track. On the other hand, the speaker information is captured via a presence node. This says that at least one of the appearing people should be the speaker (most of the times, a true situation). It allows us to capture any presence information, (like transcripts too), and does not require us to perform speaker detection by analyzing the lip movement of faces.

Finally, within a shot, two identity nodes cannot vote for the same person! This, called the “uniqueness” potential is captured by the red lines joining each identity node to the other.

We model all the above links using energy functions which are left out here (refer to the paper/thesis for more details). The energy functions enforce the identity to move towards the weighted combination of the face and clothing counterparts, while satisfying the constraints imposed by the presence (speaker) and uniqueness. This is achieved using energy minimization. The various components can now be weighted differently, however in our experiments we found that the varying weights had insignificant impact. This is a good thing, since one can choose weights like all equal, or equal for modalities and double for penalties, and not experience drastic changes in performance.

Sample images of the method… (click to enlarge, and read caption for clarification)

If the above information is useful to you, please cite
M. Tapaswi, M. BÃ¤uml, and R. Stiefelhagen, “Knock! Knock! Who is it?” Probabilistic Person Identification in TV Series, CVPR 2012. (project page)

# Clothing based Person Recognition in TV Series

As one contribution of my Master Thesis, we developed an approach to the general problem of person identification. In TV series, it is common that the actors do not face the camera. Furthermore, they could be seen in profile view making the identification quite hard, or even with the back towards the camera, in which case no faces are found! Using the current methods which rely on detected faces it becomes impossible to identify the person. Nonetheless, we viewers (as humans!) very well know that the person exists, and even who the person is!

Current systems based on clothing recognition with face recognition, usually draw a box below each located face, and call that the clothing area. Obviously, if no face is detected, there is no clothing box too, defeating the whole purpose of locating and identifying people when the face is not visible. To improve on this, we first perform person detection and tracking using poselets.

Another challenge we must tackle is that clothing changes over time in an episode. We thus divide the entire episode into meaningful scenes, during which the actor usually does not change his/her clothing.

Learning Clothing from Faces

But the best part of our system is, we do not need to explicitly label the person id corresponding to the clothing. The first step involves clustering images involving people with similar clothing. We do this using the agglomerative hierarchical clustering (using Matlab) technique. Further, in person detections which contain detected and recognized faces (some of them do) we use the Face Recognition results to label clusters with that particular ID. Not all clusters are labeled, and those which are labeled need not be always correct. For the unassigned clusters, each image in the cluster is compared with the ID assigned cluster and we associate it with the ID corresponding to most similar clothing.

The fascinating part of this process is that, the accuracy of clothing-based identification is slightly better than the face recognition, which was used to assign IDs in the first place! The conclusion being that the clustering has done a great job.

Sample images of the method… (click to enlarge, and read caption for clarification)

If the above information is useful to you, please cite
M. Tapaswi, M. BÃ¤uml, and R. Stiefelhagen, “Knock! Knock! Who is it?” Probabilistic Person Identification in TV Series, CVPR 2012. (project page)