As is seen in the graph above, the TV series is first split into scenes, using the special sequences in TBBT, and then split into shots (abrupt camera / view changes). Within each shot, each person track is associated with a random variable vector “ID”, taking on values in the range of 0 to 1 depending on the supporting evidence. The node F provides face information (if it exists) and C provide clothing information to each person identity track. On the other hand, the speaker information is captured via a presence node. This says that at least one of the appearing people should be the speaker (most of the times, a true situation). It allows us to capture any presence information, (like transcripts too), and does not require us to perform speaker detection by analyzing the lip movement of faces.
Finally, within a shot, two identity nodes cannot vote for the same person! This, called the “uniqueness” potential is captured by the red lines joining each identity node to the other.
We model all the above links using energy functions which are left out here (refer to the paper/thesis for more details). The energy functions enforce the identity to move towards the weighted combination of the face and clothing counterparts, while satisfying the constraints imposed by the presence (speaker) and uniqueness. This is achieved using energy minimization. The various components can now be weighted differently, however in our experiments we found that the varying weights had insignificant impact. This is a good thing, since one can choose weights like all equal, or equal for modalities and double for penalties, and not experience drastic changes in performance.
Sample images of the method… (click to enlarge, and read caption for clarification)
If the above information is useful to you, please cite
M. Tapaswi, M. Bäuml, and R. Stiefelhagen, “Knock! Knock! Who is it?” Probabilistic Person Identification in TV Series, CVPR 2012. (project page)