November 29

Add to Calendar 2017-11-29 15:00:00 2017-11-29 16:00:00 America/New_York Visual Understanding of Human Activity: Towards Ambient Intelligence Title: Visual Understanding of Human Activity: Towards Ambient IntelligenceAbstract:A goal of AI has long been intelligent systems interacting with humans to assist us in every aspect of our lives. Half of this story is creating robotic and autonomous agents. The other half is endowing the physical space and environment around us with ambient intelligence. In this talk I will discuss my work on visual understanding of human activity towards the latter goal. I will present recent works along several directions required for ambient intelligence. The first addresses the dense and detailed action labeling needed for full contextual awareness. The second is a reinforcement learning-based approach to learn policies for efficient action detection, an important factor in embedded vision. And the third is a method for learning new concepts from noisy web videos, towards the fast adaptivity needed for constantly evolving smart environments. Finally, I will discuss the transfer of my work from theory into practice, specifically the implementation of an AI-Assisted Smart Hospital where we have equipped units at two partner hospitals with visual sensors, towards enabling ambient intelligence for assistance with clinical care.Bio:Serena Yeung is a 5th year PhD student in the Stanford Vision and Learning Lab, advised by Fei-Fei Li and Arnold Milstein. Her research focuses on developing computer vision algorithms for video understanding and human activity recognition. More broadly, she is passionate about using these algorithms to equip physical spaces with ambientintelligence, in particular a Smart Hospital. Serena is a member of the Stanford Partnership in AI-Assisted Care (PAC), a collaboration between the Stanford School of Engineering and School of Medicine. She interned at Facebook AI Research in 2016, and Google Cloud AI in 2017. She was also co-instructor for Stanford's CS231N course on Convolutional Neural Networks for Visual Recognition in 2017. D463 (Star)

October 31

Add to Calendar 2017-10-31 11:30:00 2017-10-31 12:30:00 America/New_York Learning a Driving Model from Imperfect Demonstrations Abstract:Robust real-world learning should benefit from both demonstrations and interaction with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on reward from the environment. These tasks have divergent losses which are difficult to jointly optimize; further, such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstration and refines the policy in a real environment. Crucially, both learning from demonstration and interactive refinement use exactly the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data, since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.Bio: Huazhe (Harry) Xu is a Ph.D. student under Prof. Trevor Darrell in Berkeley Artificial Intelligence Research Lab (BAIR) at University of California, Berkeley (UC Berkeley). He received the B.S.E. degree in Electrical Engineering from Tsinghua University in 2016. His research focuses on computer vision, reinforcement learning and their applications such as autonomous driving. 32-D507

September 12

Add to Calendar 2017-09-12 11:00:00 2017-09-12 12:00:00 America/New_York Person Search: A New Research Paradigm Person Search: A New Research ParadigmSpeaker: Shuang Li from CUHKTime: 11:00 am to 12:00 pmDate: Tuesday, September 12, 2017Location: 32-D463Abstract:Automatic person search plays a key role in finding missing people and criminal suspects. However, existing methods are based on manually cropped person images, which are unavailable in the real world. Also, there might be only verbal descriptions of suspects’ appearance in many criminal cases. To improve the practicability of person search in real world applications, we propose two new branches: (i) finding a target person in the gallery of whole scene images and (ii) using natural language description to search people.In this talk, I will first present a joint pedestrian detection and identification network for person search from whole scene images. An Online Instance Matching (OIM) loss function is proposed to train the network, which is scalable to datasets with numerous identities. Then, I will talk about natural language based person search. A two-stage framework is proposed to solve this problem. The stage-1 network learns to embed textual and visual features with a Cross-Modal Cross-Entropy (CMCE) loss, while stage-2 network refines the matching results with a latent co-attention mechanism. In stage-2, the spatial attention relates each word with corresponding image regions while the latent semantic attention aligns different sentence structures to make the matching results more robust to sentence structure variations. The proposed methods produce the state-of-the-art results for person search.Bio:Shuang Li is an M.Phil student at the Chinese University of Hong Kong, advised by Prof. Xiaogang Wang. She works in Multimedia Lab with Prof. Xiaoou Tang. Her research interests include computer vision, natural language processing, and deep learning, especially image-text relationship and person re-identification. She was a research intern at Disney Research, Pittsburgh. 32-D463 (Star Room)

August 14

Add to Calendar 2017-08-14 15:00:00 2017-08-14 16:00:00 America/New_York Directional Field Synthesis, Design, and Processing Abstract: Directional fields on discrete surfaces and in volumes are key components of geometry processing. Many applications make use of such fields, among which are remeshing, surface parametrization (texture mapping), texture synthesis, fluid simulation, and many more. I will present the challenges and the limitations in the design and the analysis of such fields, and focus on novel ways to compute them.Bio: Amir Vaxman is an universitair docent (assistant professor) in the Division Virtual Worlds at the Department of Information and Computing Sciences at Utrecht University, The Netherlands. Before his position in UU, he was a postdoctoral fellow in TU Wien (Vienna) at the Geometric Modeling and Industrial Geometry group, where he also received the Lise-Meitner fellowship. He earned his BSc in computer engineering, and his PhD in Computer science from the Technion-IIT. His research interests are geometry processing and discrete differential geometry, focusing on directional-field design, unconventional meshes, constrained shape spaces, architectural geometry, and medical applications. Seminar Room G449 (Patil/Kiva)

July 06

Add to Calendar 2017-07-06 15:00:00 2017-07-06 16:00:00 America/New_York Quantifying Interpretability of Deep Learning in Visual Recognition Abstract:We propose a general framework called Network Dissection for quantifying the interpretability of latent representations of deep convolutional neural networks (CNNs) by evaluating the alignment between individual hidden units and a set of semantic concepts. Given any CNN model, the proposed method draws on a broad data set of visual concepts to score the semantics of hidden units at each intermediate convolutional layer. The units with semantics are given labels across a range of objects, parts, scenes, textures, materials, and colors. We use the proposed method to test the hypothesis that interpretability of units is equivalent to random linear combinations of units, then we apply our method to compare the latent representations of various networks when trained to solve different supervised and self-supervised training tasks. We further analyze the effect of training iterations, compare networks trained with different initializations, examine the impact of network depth and width, and measure the effect of dropout and batch normalization on the interpretability of deep visual representations. We demonstrate that the proposed method can shed light on characteristics of CNN models and training methods that go beyond measurements of their discriminative power. The project page is at http://netdissect.csail.mit.edu. Biography:Bolei Zhou is the 5th-year Ph.D. Candidate in Computer Science and Artificial Intelligence Laboratory at MIT, working with Prof. Antonio Torralba. His research is on computer vision and machine learning, with particular interest in visual scene understanding and network interpretability. He is the award recipient of the Facebook Fellowship, Microsoft Research Asia Fellowship, MIT Greater China Fellowship. More details about his research work is at the homepage http://people.csail.mit.edu/bzhou/. 32-D463 (Star)

June 12

Add to Calendar 2017-06-12 16:00:00 2017-06-12 17:00:00 America/New_York A picture of the energy landscape of deep neural networks Abstract: Stochastic gradient descent (SGD) is the gold standard of optimization in deep learning. It does not, however, exploit the special structure and geometry of the loss functions we wish to optimize, viz. those of deep neural networks. In this talk, we will focus on the geometry of the energy landscape at local minima with an aim of understanding the generalization properties of deep networks.In practice, optima discovered by SGD have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We will first leverage upon this observation to construct an algorithm named Entropy-SGD that maximizes a local version of the free energy. Such a loss function favors flat regions of the energy landscape which are robust to perturbations and hence more generalizable, while simultaneously avoiding sharp, poorly-generalizable --- although possibly deep --- valleys. We will discuss connections of this algorithm with belief propagation and robust ensemble learning. Furthermore, we will establish a tight connection between such non-convex optimization algorithms and nonlinear partial differential equations. Empirical validation on CNNs and RNNs shows that Entropy-SGD and related algorithms compare favorably to state-of-the-art techniques in terms of both generalization error and training time.arXiv: https://arxiv.org/abs/1611.01838, https://arxiv.org/abs/1704.04932Bio: Pratik Chaudhari is a PhD candidate in Computer Science at UCLA. With his advisor Stefano Soatto, he focuses on optimization algorithms for deep networks. He holds Master's and Engineer's degrees in Aeronautics and Astronautics from MIT where he worked on stochastic estimation and randomized motion planning algorithms for urban autonomous driving with Emilio Frazzoli. 32-D507

June 06

Domain Adaptation: from Manifold Learning to Deep Learning

Andreas Savakis
Rochester Institute of Technology
Add to Calendar 2017-06-06 11:00:00 2017-06-06 12:00:00 America/New_York Domain Adaptation: from Manifold Learning to Deep Learning Abstract: Domain Adaptation (DA) aims to adapt a classification engine from a train (source) dataset to a test (target) dataset. The goal is to remedy the loss in classification performance due to the dataset bias attributed to variations across test/train datasets. This seminar presents an overview of domain adaptation methods from manifold learning to deep learning. Popular DA methods on Grassmann manifolds include Geodesic Subspace Sampling (GSS) and Geodesic Flow Kernel (GFK). Grassmann learning facilitates compact characterization by generating linear subspaces and representing them as points on the manifold. I will discuss robust versions of these methods that combine L1-PCA and Grassmann manifolds to improve DA performance across datasets.Deep domain adaptation has received significant attention recently. I will present a new domain adaptation approach for deep learning that utilizes Adaptive Batch Normalization to produce a common feature-space between domains. Our method then performs label transfer based on subspace alignment and k-means clustering on the feature manifold to transfer labels from the closest source cluster to each target cluster. The proposed manifold-guided label transfer method produces state-of-the-art results for deep adaptation on digit recognition datasets.Bio:Andreas Savakis is Professor of Computer Engineering at Rochester Institute of Technology (RIT) and director of the Real Time Vision and Image Processing Lab. He served as department head of Computer Engineering from 2000 to 2011. He received the B.S. with Highest Honors and M.S. degrees in Electrical Engineering from Old Dominion University in Virginia, and the Ph.D. in Electrical and Computer Engineering with Mathematics Minor from North Carolina State University in Raleigh NC. He was Senior Research Scientist with the Kodak Research Labs before joining RIT. His research interests include domain adaptation, object tracking, expression and activity recognition, change detection, deep learning and computer vision applications. Prof. Savakis has co-authored over 100 publications and holds 11 U.S. patents. He received the NYSTAR Technology Transfer Award for Economic Impact in 2006, the IEEE Region 1 Award for Outstanding Teaching in 2011, and the best paper award at the International Symposium on Visual Computing (ISVC) in 2013. He is Associate Editor of the IEEE Transactions on Circuits and Systems for Video Technology and the Journal for Electronic Imaging. He co-organized the first Int. Workshop on Extreme Imaging (http://extremeimaging.csail.mit.edu/) at ICCV 2015, and is Guest Editor at the IEEE Transactions on Computational Imaging for a Special Issue on Extreme Imaging. 32-D463

April 13

Add to Calendar 2017-04-13 15:00:00 2017-04-13 16:00:00 America/New_York The lifetime of an object - an object’s perspective onto interactions As opposed to the traditional notion of actions and activities in computer vision, where the motion (e.g. jumping) or the goal (e.g. cooking) is the focus, I will argue for an object-centred perspective onto actions and activities, during daily routine or as part of an industrial workflow. I will present approaches for the understanding of ‘what’ objects one interacts with, ‘how’ these objects have been used and ‘when’ interactions takes place.The talk will be divided into three parts. In the first part, I will present unsupervised approaches to automatic discovery of task-relevant objects and their modes of interaction, as well as automatically providing guidance on using novel objects through a real-time wearable setup. In the second part, I will introduce supervised approaches to two novel problems: action completion – when an action is attempted but not completed, and expertise determination - who is better in task performance and who is best. In the final part, I will discuss work in progress on uncovering labelling ambiguities in object interaction recognition including ambiguities in defining the temporal boundaries for object interactions and ambiguities in verb semantics.Bio: Dima Damen is Lecturer (Assistant Professor) in Computer Vision at the University of Bristol. She received her PhD from the University of Leeds (2009). Dima's research interests are in the automatic understanding of object interactions, actions and activities using static and wearable visual (and depth) sensors. Dima co-chaired BMVC 2013, is area chair for BMVC (2014-2017) and associate editor of IET Computer Vision. In 2016, Dima was selected as a Nokia Research collaborator. She currently supervises 7 PhD students, and 2 postdoctoral researchers. 32-D463 (Star)

February 21

Attention and Activities in First Person Vision

Yin Li
College of Computing - Georgia Tech
Add to Calendar 2017-02-21 11:00:00 2017-02-21 12:00:00 America/New_York Attention and Activities in First Person Vision Advances in sensor miniaturization, low-power computing, and battery life have enabled the first generation of mainstream wearable cameras. Millions of hours of videos have been captured by these devices, creating a record of our daily visual experiences at an unprecedented scale. This has created a major opportunity to develop new capabilities and products based on First Person Vision (FPV)--the automatic analysis of videos captured from wearable cameras. Meanwhile, vision technology is at a tipping point. Major progress has been made over the last few years in both visual recognition and 3D reconstruction. The stage is set for a grand challenge of activity recognition in FPV. My research focuses on understanding naturalistic daily activities of the camera wearer in FPV to advance both computer vision and mobile health. In the first part of this talk, I will demonstrate that first person video has the unique property of encoding the intentions and goals of the camera wearer. I will introduce a set of first person visual cues that captures the users' intent and can be used to predict their point of gaze and the actions they are performing during activities of daily living. Our methods are demonstrated using a benchmark dataset that I helped to create. In the second part, I will describe a novel approach to measure children’s social behaviors during naturalistic face-to-face interactions with an adult partner, who is wearing a camera. I will show that first person video can support fine-grained coding of gaze (differentiating looks to eyes vs. face), which is valuable for autism research. Going further, I will present a method for automatically detecting moments of eye contact. This is joint work with Zhefan Ye, Sarah Edmunds, Dr. Alireza Fathi, Dr. Agata Rozga and Dr. Wendy Stone.Bio: Yin Li is currently a doctoral candidate in the School of Interactive Computing at the Georgia Institute of Technology. His research interests lie at the intersection of computer vision and mobile health. Specifically, he creates methods and systems to automatically analyze first person videos, known as First Person Vision (FPV). He has particular interests in recognizing the person's activities and developing FPV for health care applications. He is the co-recipient of the best student paper awards at MobiHealth 2014 and IEEE Face & Gesture 2015. His work had been covered by MIT Tech Review, WIRED UK and New Scientist. 32-D463