Looking at People: The past, the present and the future

Tutorial at CVPR 2012, Providence, Rhode Island, USA, 2012


General Information

Thomas B. Moeslund (tbm@create.aau.dk), Aalborg University, Denmark

Leonid Sigal (lsigal@disneyresearch.com), Disney Research, Pittsburgh, USA

Adrian Hilton (A.Hilton@surrey.ac.uk), University of Surrey, UK

Volker Krüger (vok@m-tech.aau.dk), Aalborg University, Denmark
Aaron Bobick, Georgia Tech, USA

Amit Roy Chowdhury, UC Riverside, USA

Jeffrey Cohn, CMU, USA

Rogerio Feris, IBM T.J. Watson Research Center, New York

David Fleet, University of Toronto, Canada

Shaogang Gong, Queen Mary University, UK

Raghuraman Gopalan (on behalf of Rama Chellappa), AT&T Labs-Research, USA

Haowei Liu, Intel Santa Clara

Deva Ramanan, UC Irvine, USA

Fernando De la Torre, CMU, USA

Mohan Trivedi, UC San Diego, USA

Time: June 21st, 2012
Duration: Full-day (~8 hours)
Location: Room 555B

Course Description

Over the course of the last 10-20 years the field of computer vision has been preoccupied with the problem of looking at people. Hundreds, if not thousands, of papers have been published on the subject that span face detection, pose estimation, tracking, activity recognition, etc. This tutorial is designed to give an introduction to and assessment of state-of-the-art in this very active field. The tutorial builds on the book: Visual Analysis of Humans: Looking at People published by Springer in 2012. The book is a collection of chapters that are written by the top experts in the field; the organizers of the tutorial are also the editors of the upcoming book. The list of contributing authors and content of the book can be found here. The book is intended to serve the dual purpose of being a reference and a tutorial to the people entering the field. Because this tutorial is an extension of this idea, it will similarly consists of a series of talks by experts in the corresponding fields. Tutorial will be broken down into 4 parts: (1) detection and tracking, (2) articulated pose estimation and tracking, (3) activity recognition, and (4) applications. In each part we will have 2-3 invited lecturers. Each invited lecturer will give a talk on a focused subject within a larger context of looking at people lasting roughly 35 minutes. The lectures will be geared towards general CV audience and will outline the key advances and future challenges in the problems involved. The rough schedule, list of the proposed invited lecturers, and the topics covered are listed below.  

Syllabus and Schedule

Below is the syllabus and a rough schedule for the tutorial. 
  • [8:40 - 8:50] Introduction, motivation and welcome remarks by the organizers
    coffee break (30 minutes)
  • [10:30 - 11:40] Articulated pose estimation and tracking
  • [11:40 - 12:15] Activity recognition

lunch (1h 35min)

    coffee break (30 minutes)

Course Materials


Instructor Biographies

Dr. Bobick's research spans a variety of aspects of computer vision. His primary work has focused on video sequences where the imagery varies over time either because of change in camera viewpoint or change in the scene itself. He has published papers addressing many levels of the problem from validating low level optic flow algorithms to constructing multi-representational systems for an autonomous vehicle to the representation and recognition of high level human activities. The current emphasis of his work is on action understanding, where the imagery is of a dynamic scene and the goal is to describe the action or behavior. Three examples are the basic recognition of human movements, natural gesture understanding, and the classification of football plays. Each of these examples requires describing human activity in a manner appropriate for the domain, and developing recognition techniques suitable for those representations.
Recently, Dr. Bobick has also explored the development of interactive environments where advanced sensing modalities provide input based upon the users' actions and, hopefully, intentions. The intriguing element of interactive environments is that the context of the situation can be exploited in the interpretation of the user's behavior. An example of such an environment is the KidsRoom, the world's first, interactive narrative play-space for children. The room employed large-scale video and sound to take the children through a fantasy story; all the sensing was accomplished using computer vision. A more current and ambitious project is the Aware Home Research Initiative. The goal of that effort is to impart sufficient perception and interface capabilities to a house such that it can enhance the quality of life of the inhabitants. A domestic setting provides a wealth of contextual information that will be needed to assist in understanding the activities of the people within.

Dr. Roy-Chowdhury leads the Video Computing Group at UCR. His group is studying problems in video analysis with applications in national and homeland security, commercial multimedia and computational biology. The underlying approach of his research is to harness various methods in systems theory, signal processing, machine learning, mathematics and statistics to the analysis of images and videos in order to obtain an understanding of their content. This scientific understanding can lead to machine vision technologies that can provide an automated/semi-automated analysis of the 3D environment from images/videos, analogous to the capabilities of biological visual systems. Currently, the group is focused on multi-agent autonomous camera networks, modeling and recognition of complex behaviors in video, and image-based modeling of biological growth dynamics (specifically in plants). Prof. Roy-Chowdhury is a PI on several grants from the National Science Foundation, Office of Naval Research, Army Research Office, DARPA, and private industries like CISCO and Lockheed-Martin. His recent book on Camera Networks provides an overview of current research in the field. He has served as a program committee member and reviewer in various capacities, organized workshops and special sessions, is an Associate Editor of the IEEE Tans. on Systems, Man and Cybernetics - B and Machine Vision Applications, and a Section Editor of Elsevier’s Electronic Reference on Signal Processing. He has recently co-edited a book on the topic of "Distributed Video Sensor Networks".For more details, please see his CV.

Jeffrey Cohn is Professor of Psychology at the University of Pittsburgh and Adjunct Faculty at the Robotics Institute, Carnegie Mellon University. He received his PhD in psychology from the University of Massachusetts at Amherst.  Dr. Cohn has led interdisciplinary and inter-institutional efforts to develop advanced methods of automatic analysis of facial expression and prosody and applied those tools to research in human emotion, interpersonal processes, social development, and psychopathology.  He co-developed influential databases, Cohn-Kanade, MultiPIE, and Pain Archive, co-edited two recent special issues of Image and Vision Computing on facial expression analysis, and co-chaired the 8th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2008).

Dr. Rogerio Schmidt Feris is currently a research scientist at IBM T.J. Watson Research Center, New York, and an Affiliate Assistant Professor at University of Washington. He joined IBM in 2006 after receiving a PhD in computer science from the University of California, Santa Barbara. In 2008, he worked as Adjunct Professor at Columbia University. His publications have appeared in major computer vision/graphics conferences and journals, including ICCV, CVPR, SIGGRAPH, and PAMI. He received several awards, including a recent IBM Master inventor honor and a prestigious IBM Outstanding Innovation Achievement Award in 2011. For more details, see http://rogerioferis.com

David J Fleet received the PhD in Computer Science from the University of Toronto in 1991. He was on faculty at Queen's University in Kingston from 1991 to 1998, and then Area Manager and Research Scientist at the Palo Alto Research Center (PARC) from 1999 to 2003.
In 2004 he joined the University of Toronto as Professor of Computer Science.

His research interests include computer vision, image processing, visual perception, and visual neuroscience. He has published research articles, book chapters and one book on various topics including the estimation of optical flow and stereoscopic disparity, probabilistic methods in motion analysis, modeling appearance in image sequences, motion perception and human stereopsis, hand tracking, human pose tracking, latent variable models, and physics-based models for human motion analysis. In 1996 Dr. Fleet was awarded an Alfred P. Sloan Research Fellowship for his work on computational models of perception. He has won paper awards at ICCV 1999, CVPR 2001, UIST 2003, BMVC 2009. In 2010 he was awarded the Koenderink Prize for his work with Michael Black and Hedvig Sidenbladh on human pose tracking. He has served as Area Chair for numerous computer vision and machine learning conference. He was Program Co-chair for the 2003 IEEE Conference on Computer Vision and Pattern Recognition. He will be Program Co-Chair for the 2014 European Conference on Computer Vision. He has been Associate Editor, and Associate Editor-in-Chief for IEEE TPAMI, and currently serves on the TPAMI Advisory Board.

Shaogang Gong is Professor of Visual Computation and Head of the Computer Vision Group at Queen Mary University of London. He has published over 250 papers and written 2 books (Visual Analysis of Behaviour: From Pixels to Semantics, Springer, 2011; Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2000). For the last 20 years, he has led numerous UK, EU and US academic, governmental and industrial collaborative projects on developing computer vision systems for public security and safety applications. He served on the UK Government Chief Scientific Adviser Beddington Science Review Steering Panel (2008-2009). He is a founding director and chief scientist of Vision Semantics Limited. He is a Fellow of IEE and BCS, and a member of the UK Computing Research Committee.

Raghuraman Gopalan is a senior member of technical staff at the AT&T Labs-Research. He received his Ph.D. in Electrical and Computer Engineering at the University of Maryland, College Park in 2011. His research interests are in computer vision and machine learning, with a specific focus on object recognition and video understanding problems.

Haowei Liu is a research engineer in Perceptual Computing Group, Intel Santa Clara. He received his PhD degree from University of Washington in June, 2011. He has interned in major research organizations during his PhD study including Intel Lab Seattle and IBM T.J. Watson Research Center. Prior to his PhD study, he was a software design engineer in Microsoft. He holds an MS and BS in Computer Science from University of California, San Diego and National Taiwan University.

Deva Ramanan Deva Ramanan is an assistant professor of Computer Science and the co-director of the Computational Vision Lab at the University of California at Irvine. Prior to joining UCI, he was a Research Assistant Professor at the Toyota Technological Institute at Chicago (2005-2007). He also held visiting researcher positions in the Robotics Institute at Carnegie Mellon University in 2006 and Microsoft Research in 2008. He received his B.S. degree with distinction in computer engineering from the University of Delaware in 2000, graduating summa cum laude. He received his Ph.D. in Electrical Engineering and Computer Science with a Designed Emphasis in Communication, Computation, and Statistics from UC Berkeley in 2005. His research interests span computer vision, machine learning, and computer graphics, with a focus on the application of understanding people through images and video. His past work focused on articulated tracking, while recent work has focused on object recognition. His work in this area won or received special recognition at the PASCAL Visual Object Class Challenge, 2007-2010, including a Lifetime Achievement Prize in 2010. His work on contextual object modeling won the 2009 David Marr prize. He was awarded an NSF Career Award in 2010. His work is supported by NSF, ONR, DARPA, as well as industrial collaborations with the Intel Science and Technology Center for Visual Computing, Google Research, and Microsoft Research. He serves on the editorial board of the International Journal of Computer Vision (IJCV), is a senior program committee member for the IEEE Conference of Computer Vision and Pattern Recognition (CVPR), and has served on multiple NSF panels for computer vision and machine learning.

Fernando De la Torre is an Associate Research Professor in the Robotics Institute at Carnegie Mellon University. He received his B.Sc. degree in Telecommunications, as well as his M.Sc. and Ph. D degrees in Electronic Engineering from La Salle School of Engineering at Ramon Llull University, Barcelona, Spain in 1994, 1996, and 2002, respectively. His research interests are in the fields of Computer Vision and Machine Learning. Specifically, he is interested in modeling and recognizing human behavior with a focus on understanding human behavior from multimodal sensors (e.g. video, body sensors). He has done extensive work on facial image analysis (e.g., facial expression recognition, facial feature tracking).  In machine learning his interest centers on developing efficient and robust supervised and unsupervised methods to  model high-dimensional data. Currently, he is directing the Component Analysis Laboratory (http://ca.cs.cmu.edu) and the Human Sensing Laboratory (http://humansensing.cs.cmu.edu) at Carnegie Mellon University.  He has over 100 publications in referred journals and conferences.  He has organized and co-organized several workshops and has given tutorials at international conferences on the use and extensions of Component Analysis.

Mohan Trivedi received his PhD in Electrical Engineering from Utah State University in 1979, after completing undergraduate work in India. At Utah State, he received a Graduate Research Scholarship, and went on to teach at .... He has published extensively and has edited over a dozen volumes including books, special issues, video presentations, and conference proceedings. Trivedi is a recipient of the Pioneer Award and the Meritorious Service Award from the IEEE Computer Society; and the Distinguished Alumnus Award from Utah State University. He is a Fellow of the International Society for Optical Engineering (SPIE). He is a founding member of the Executive Committee of the UC System-wide Digital Media Innovation Program (DiMI). Trivedi is also Editor-in-Chief of Machine Vision & Applications.


Organizer Biographies

ThomasPhotoThomas B. Moeslund is Professor of computer vision and head of the Visual Analysis of People Lab at Aalborg University, Denmark. In 2000 - 2003 he acted as a Vision Engineer consultant at the company Thoustrup and Overgaard, Randers, Denmark. Prof. Moeslund's research interests include: Computer vision, Machine vision, Looking at people (human motion capture, gesture recognition, tracking, pose estimation), augmented reality, HCI, computer graphics animations, and multi-modal systems. Prof. Moeslund has been involved in nine national and international research projects, both as coordinator, WP leader and researcher. He has published more than 75 peer reviewed journal and conference papers. Awards include a best IEEE paper award, a most cited paper award (from CVIU), and a teacher of the year award. He serves as associate editor/editorial board member of four international journals. He acts as reviewer for all major journals within the field of computer vision and image processing, and has been in a number PC committees. Moreover he has co-chaired five international workshops/toturials related to human motion analysis. He co-edited a recent CVIU SI on human motion.

LeonidPhoto Leonid Sigal is a Research Scientist at Disney Research Pittsburgh, in conjunction with Carnegie Mellon University. Prior to this he was a postdoctoral fellow in the Department of Computer Science at University of Toronto. He completed his Ph.D. under the supervision of Prof. Michael J. Black at Brown University in 2008; he received his B.Sc. degrees in Computer Science and Mathematics from Boston University (1999), his M.A. from Boston University (1999), and his M.S. from Brown University (2003). From 1999 to 2001, he worked as a senior vision engineer at Cognex Corporation, where he developed industrial vision applications for pattern analysis and verification. Leonid's research interests mainly lie in the areas of computer vision, machine learning, and computer graphics, but also borderline fields of psychology and humanoid robotics. He has published more than 30 papers in top venues and journals in computer vision, computer graphics and machine learning (including publications in PAMI, IJCV, CVPR, ICCV, ECCV, NIPS, and ACM SIGGRAPH). His work received the Best Paper Award at the Articulate Motion and Deformable Objects Conference in 2006 (with Prof. Michael J. Black). He acts as reviewer for all major conferences and journals within the fields of computer vision and computer graphics, and has been consistently on PC committees for CVPR, ICCV, ECCV, and IJCAI. He has co-edited an IJCV special issue on Evaluation of Human Motion and Pose Estimation last year.

AdrianPhoto Adrian Hilton is Professor of Computer Vision and Graphics and Head of the Visual Media Research Group at the University of Surrey, UK. Over the past decade he has published over 100 refereed journal and international conference research articles in robust computer vision techniques to build models of real world objects from images to meet the requirements of the entertainment and communication industries. Scientific contributions have been recognized by two journal and one conference best paper awards. Innovative contributions of this research led to the first commercial hand-held 3D scanner and the first system for capturing animated models of people have been recognized through two EU IST Awards for Innovation, a DTI Manufacturing Industry Achievement Award and a Computer Graphics World Innovation Award. He currently serves as an area editor for the journal Computer Vision and Image Understanding, the EPSRC Peer Review College for UK funding applications and the Executive of the IEE Professional Network in Multimedia Communications. He is a Chartered Engineer and member of IEE, IEEE and ACM.

VolkerPhoto Volker Krüger received his Dipl.-Inf. degree and doctor's degree from Christian-Albrechts-Universität (CAU) Kiel, Germany, in 1997 and 2000, respectively. He was a postdoctoral fellow at the Center for Autmation Resarch at Univ. of Maryland from 2000-2002. Since 2002, Volker Krüger is Assoc. Prof. at Aalborg University in Denmark. Volker Krüger is with the Computer Vision and Machine Intelligence Lab (CVMI) at the Copenagen Inst. of Technology (CIT) of Aalborg University. His research focuses on computer vision and robotics based approaches for learning and recognizing human actions and activities.