I am an assistant professor of computer science in the School of Computing and Information Systems (SCIS), Singapore Management University (SMU). Here is my faculty profile. From 2018 to 2019, I was a research fellow working with Prof. Tat-Seng Chua at the National University of Singapore and Prof.Dr. Bernt Schiele at the MPI for Informatics. From 2016 to 2018, I held the Lise Meitner Award Fellowship and worked with Prof.Dr. Bernt Schiele and Prof. Dr. Mario Fritz at the MPI for Informatics. I got my Ph.D. degree from Peking University in 2016 and my thesis was advised by Prof. Hong Liu. In 2014, I visited in the research group of Prof. Tatsuya Harada at the University of Tokyo. My research interests are computer vision and machine learning.


  • We are looking for PhD applicants with strong backgrounds in computer science, supported by MOE or AISG. [call][info]
  • We are looking for Postdocs with strong backgrounds of object detection or semantic segmentation.
  • We are looking for local PhD applicants in Industrial Postgraduate Programme (IPP), supported by EDB. [info][call]
  • Three papers respectively about face image clustering, OOD generalization and insufficient data learning are accepted to ECCV '22.
  • Our paper about weakly-supervised semantic segmentation is accepted to CVPR '22.
  • I am awarded "outstanding reviewer" by NeurIPS '21.
  • Two papers respectively about self-supervised learning and class-incremental learning are accepted to NeurIPS '21.
  • Three papers respectively about causal attention, domain adaptation and semantic segmentation are accepted to ICCV '21.
  • Our paper about food image segmentation is accepted to ACM Multimedia '21. [project]
  • I am awarded "Lee Kong Chian Fellow" by SMU.
  • We release a large-scale benchmark for food image segmentation with our pre-trained models using CNN and ViT! [project]
  • I am awarded "outstanding reviewer" by ICLR '21.
  • FoodAI++, a demo of our food image segmentation, is now online. [demo]
  • Two papers respectively about incremental learning and zero-shot learning are accepted to CVPR '21.
  • The 1st workshop of Causality in Vision at CVPR '21. The best paper is awarded a US$1,000 (cash) prize. [homepage]
  • Two papers respectively about semantic segmentation and few-shot learning are accepted to NeurIPS '20.
  • The extended paper of our CVPR'19 work (MTL) is accepted to IEEE Transactions on PAMI.
  • Two papers respectively about semantic segmentation and few-shot learning are accepted to ECCV '20.
  • Our paper about the application of teacher-student networks is accepted to IJCAI '20.
  • We release the code of E3BM (SOTA few-shot learning results and LITTLE overhead costs)! [github]
  • We release the code of Mnemonics Training (SOTA multi-class incremental learning results on ImageNet)! [github]
  • We release the code of VC R-CNN (SOTA image representation on MS-COCO Detection and Open Images)! [github]
  • Two papers respectively about incremental learning (oral presentation) and unsupervised learning are accepted to CVPR '20.
  • We will host the ACM Multimedia Asia '20 conference in Singapore! [homepage]
  • An article about my research is posted on "Research at SMU Nov 2019 Issue". [link]
  • Our paper about semi-supervised few-shot learning is accepted to NeurIPS '19.
  • Our paper about mixed-dish image recognition is accepted to ACM Multimedia '19.
  • Our paper about visual relationship feature augmentation is accepted to BMVC '19.
  • Our paper about few-shot learning is accepted to CVPR '19.
  • Ph.D. Students

    Yaoyao Liu
    Since Jun 2018
    (with Bernt Schiele)
    MPI for Informatics yaoyao.liu[at]
    Sicheng Yu
    Since Sep 2019
    (with Jing Jiang)

    Qing Wang
    Since Jan 2021
    (with Chong Wah Ngo)

    Master Students

    Chunhui Bao
    Jan 2020-Dec 2021
    LOH Yi Lin
    Jan 2022-Nov 2022

    Research Fellows/Assistants

    Xin Fu
    Jan 2020-Dec 2020
    Research Assistant
    Beijing Jiaotong University
    Wei Qin
    Nov 2019-May 2021
    Research Assistant
    Hefei University of Technology
    Muhammad Naufal
    Aug 2020-Dec 2020
    Research Student

    Ying Liu
    Aug 2020-Mar 2021
    Research Assistant
    Xiongwei Wu
    Mar 2021-Mar 2022
    (with Ee-Peng Lim)
    Xin Zhao
    Jun 2021-May 2022
    Jilin University

    Harshit Jain
    Aug 2021-Dec 2021
    Research Student
    Fengyun Wang
    Nov 2021-Oct 2022
    Research Assistant
    Nanjing University of Science and Technology
    Ning Han
    Nov 2021-Oct 2022
    (with Ee-Peng Lim)
    Hunan University

    AW Khai Loong
    Jan 2022-May 2022
    Research Assistant

    Selected Conference Publications [Top Venues]


    ECCV2022_CONTEXT Class Is Invariant to Context and Vice Versa: On Learning Invariance for Out-Of-Distribution Generalization
    Jiaxin Qi, Kaihua Tang, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang
    European Conference on Computer Vision 2022, ECCV '22.
    [paper] [code] [appendix]

    Out-Of-Distribution generalization (OOD) is all about learning invariance against environmental changes. If the context in every class is evenly distributed, OOD would be trivial because the context can be easily removed due to an underlying principle: class is invariant to context. However, collecting such a balanced dataset is impractical. Learning on imbalanced data makes the model bias to context and thus hurts OOD. Therefore, the key to OOD is context balance.We argue that the widely adopted assumption in prior work—the context bias can be directly annotated or estimated from biased class prediction—renders the context incomplete or even incorrect. In contrast, we point out the everoverlooked other side of the above principle: context is also invariant to class, which motivates us to consider the classes (which are already labeled) as the varying environments to resolve context bias (without context labels). We implement this idea by minimizing the contrastive loss of intra-class sample similarity while assuring this similarity to be invariant across all classes. On benchmarks with various context biases and domain gaps, we show that a simple re-weighting based classifier equipped with our context estimation achieves state-of-the-art performance. We provide theoretical justifications and source code in Appendix.

    ECCV2022_ENV Equivariance and Invariance Inductive Bias for Learning from Insufficient Data
    Tan Wang, Qianru Sun, Sugiri Pranata, Karlekar Jayashree, Hanwang Zhang
    European Conference on Computer Vision 2022, ECCV '22.
    [paper] [code] [appendix]

    We are interested in learning robust models from insufficient data, without the need for any externally pre-trained model checkpoints. First, compared to sufficient data, we show why insufficient data renders the model more easily biased to the limited training environments that are usually different from testing. For example, if all the training "swan" samples are "white", the model may wrongly use the "white" environment to represent the intrinsic class "swan". Then, we justify that equivariance inductive bias can retain the class feature while invariance inductive bias can remove the environmental feature, leaving only the class feature that generalizes to any testing environmental changes. To impose them on learning, for equivariance, we demonstrate that any off-the-shelf contrastive-based self-supervised feature learning method can be deployed; for invariance, we propose a class-wise invariant risk minimization (IRM) that efficiently tackles the challenge of missing environmental annotation in conventional IRM. State-of-the-art experimental results on real-world visual benchmarks (NICO and VIPriors ImageNet) validate the great potential of the two inductive biases in reducing training data and parameters significantly.

    ECCV2022_FACE On Mitigating Hard Clusters for Face Clustering
    Yingjie Chen, Huasong Zhong, Chong Chen, Chen Shen, Jianqiang Huang, Tao Wang, Yun Liang, Qianru Sun
    European Conference on Computer Vision 2022, ECCV '22. (Oral Presentation, 2.7%)
    [paper] [code]

    Face clustering is a promising way to scale up face recognition systems using large-scale unlabeled face images. It remains challenging to identify small or sparse face image clusters that we call hard clusters, which is caused by the heterogeneity, i.e., high variations in size and sparsity, of the clusters. Consequently, the conventional way of using a uniform threshold (to identify clusters) often leads to a terrible misclassification for the samples that should belong to hard clusters. We tackle this problem by leveraging the neighborhood information of samples and inferring the cluster memberships (of samples) in a probabilistic way. We introduce two novel modules, Neighborhood-Diffusion-based Density (NDDe) and Transition-Probability-based Distance (TPDi), based on which we can simply apply the standard Density Peak Clustering algorithm with a uniform threshold. Our experiments on multiple benchmarks show that each module contributes to the final performance of our method, and by incorporating them into other advanced face clustering methods, these two modules can boost the performance of these methods to a new state-of-the-art.

    CVPR2022_ReCAM Class Re-Activation Maps for Weakly-Supervised Semantic Segmentation
    Zhaozheng Chen, Tan Wang, Xiongwei Wu, Xian-Sheng Hua, Hanwang Zhang, Qianru Sun
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR '22.
    [paper] [code]

    Extracting class activation maps (CAM) is arguably the most standard step of generating pseudo masks for weakly supervised semantic segmentation (WSSS). Yet, we find that the crux of the unsatisfactory pseudo masks is the binary cross-entropy loss (BCE) widely used in CAM. Specifically, due to the sum-over-class pooling nature of BCE, each pixel in CAM may be responsive to multiple classes co-occurring in the same receptive field. To this end, we introduce an embarrassingly simple yet surprisingly effective method: Reactivating the converged CAM with BCE by using softmax crossentropy loss (SCE), dubbed ReCAM. Given an image, we use CAM to extract the feature pixels of each single class, and use them with the class label to learn another fully-connected layer (after the backbone) with SCE. Once converged, we extract ReCAM in the same way as in CAM.

    ACL2022_TEA Translate-Train Embracing Translationese Artifacts
    Sicheng Yu, Qianru Sun, Hao Zhang, Jing Jiang
    Association for Computational Linguistics, ACL '22.
    [paper] [code]

    Translate-train is a general training approach to multilingual tasks. The key idea is to use the translator of the target language to generate training data to mitigate the gap between the source and target languages. However, its performance is often hampered by the artifacts in the translated texts (translationese). We discover that such artifacts have common patterns in different languages and can be modeled by deep learning, and subsequently propose an approach to conduct translate-train using Translationese Embracing the effect of Artifacts (TEA). TEA learns to mitigate such effect on the training data of a source language (whose original and translationese are both available), and applies the learned module to facilitate the inference on the target language.

    AAAI2022_RED Deconfounded Visual Grounding
    Jianqiang Huang, Yu Qin, Jiaxin Qi, Qianru Sun, Hanwang Zhang
    The 36th AAAI Conference on Artificial Intelligence, AAAI '22. (15%)
    [paper] [code]

    We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck. For example, the grounding process is usually a trivial languagelocation association without visual reasoning, e.g., grounding any language query containing sheep to the nearly central regions, due to that most queries about sheep have groundtruth locations at the image center. First, we frame the visual grounding pipeline into a causal graph, which shows the causalities among image, query, target location and underlying confounder. Through the causal graph, we know how to break the grounding bottleneck: deconfounded visual grounding. Second, to tackle the challenge that the confounder is unobserved in general, we propose a confounder-agnostic approach called: Referring Expression Deconfounder (RED), to remove the confounding bias. Third, we implement RED as a simple language attention, which can be applied in any grounding method.


    NeurIPS2021_IPIRM Self-Supervised Learning Disentangled Group Representation as Feature
    Tan Wang, Zhongqi Yue, Jianqiang Huang, Qianru Sun, Hanwang Zhang
    2021 Conference on Neural Information Processing Systems, NeurIPS '21. (Spotlight Presentation, 3%)
    [paper] [code]

    A good visual representation is an inference map from observations (images) to features (vectors) that faithfully reflects the hidden modularized generative factors (semantics). In this paper, we formulate the notion of "good" representation from a group-theoretic view using Higgins' definition of disentangled representation, and show that existing Self-Supervised Learning (SSL) only disentangles simple augmentation features such as rotation and colorization, thus unable to modularize the remaining semantics. To break the limitation, we propose an iterative SSL algorithm: Iterative Partition-based Invariant Risk Minimization (IP-IRM), which successfully grounds the abstract semantics and the group acting on them into concrete contrastive learning. At each iteration, IP-IRM first partitions the training samples into two subsets that correspond to an entangled group element. Then, it minimizes a subset-invariant contrastive loss, where the invariance guarantees to disentangle the group element. We prove that IP-IRM converges to a fully disentangled representation and show its effectiveness on various benchmarks.

    NeurIPS2021_RMM RMM: Reinforced Memory Management for Class-Incremental Learning
    Yaoyao Liu, Bernt Schiele, Qianru Sun
    2021 Conference on Neural Information Processing Systems, NeurIPS '21.
    [paper] [code]

    Class-Incremental Learning (CIL) trains classifiers under a strict memory budget: in each incremental phase, learning is done for new data, most of which is abandoned to free space for the next phase. The preserved data are exemplars used for replaying. However, existing methods use a static and ad hoc strategy for memory allocation, which is often sub-optimal. In this work, we propose a dynamic memory management strategy that is optimized for the incremental phases and different object classes. We call our method reinforced memory management (RMM), leveraging reinforcement learning. RMM training is not naturally compatible with CIL as the past, and future data are strictly non-accessible during the incremental phases. We solve this by training the policy function of RMM on pseudo CIL tasks, e.g., the tasks built on the data of the 0-th phase, and then applying it to target tasks. RMM propagates two levels of actions: Level-1 determines how to split the memory between old and new classes, and Level-2 allocates memory for each specific class. In essence, it is an optimizable and general method for memory management that can be used in any replaying-based CIL method.

    ACL2021_COSY COSY: COunterfactual SYntax for Cross-Lingual Understanding
    Sicheng Yu, Hao Zhang, Yulei Niu, Qianru Sun, Jing Jiang
    Association for Computational Linguistics, ACL '21.

    Pre-trained multilingual language models, eg, multilingual-BERT, are widely used in cross-lingual tasks, yielding the state-of-the-art performance. However, such models suffer from a large performance gap between source and target languages, especially in the zero-shot setting, where the models are fine-tuned only on English but tested on other languages for the same task. We tackle this issue by incorporating language-agnostic information, specifically, universal syntax such as dependency relations and POS tags, into language models, based on the observation that universal syntax is transferable across different languages. Our approach, named COunterfactual SYntax (COSY), includes the design of SYntax-aware networks as well as a COunterfactual training method to implicitly force the networks to learn not only the semantics but also the syntax.

    ICCV2021_CaaM Causal Attention for Unbiased Visual Recognition
    Tan Wang, Chang Zhou, Qianru Sun, Hanwang Zhang
    International Conference on Computer Vision, ICCV '21.
    [paper] [code]

    Attention module does not always help deep models learn causal features that are robust in any confounding context, e.g., a foreground object feature is invariant to different backgrounds. This is because the confounders trick the attention to capture spurious correlations that benefit the prediction when the training and testing data are IID; while harm the prediction when the data are OOD. The sole fundamental solution to learn causal attention is by causal intervention, which requires additional annotations of the confounders, e.g., a "dog" model is learned within "grass+dog" and "road+dog" respectively, so the "grass" and "road" contexts will no longer confound the "dog" recognition. However, such annotation is not only prohibitively expensive, but also inherently problematic, as the confounders are elusive in nature. In this paper, we propose a causal attention module (CaaM) that self-annotates the confounders in unsupervised fashion. In particular, multiple CaaMs can be stacked and integrated in conventional attention CNN and self-attention Vision Transformer. In OOD settings, deep models with CaaM outperform those without it significantly; even in IID settings, the attention localization is also improved by CaaM, showing a great potential in applications that require robust visual saliency.

    ICCV2021_TCM Transporting Causal Mechanisms for Unsupervised Domain Adaptation
    Zhongqi Yue, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang
    International Conference on Computer Vision, ICCV '21. (Oral Presentation, 3%)
    [paper] [code]

    Existing Unsupervised Domain Adaptation (UDA) literature adopts the covariate shift and conditional shift assumptions, which essentially encourage models to learn common features across domains. However, due to the lack of supervision in the target domain, they suffer from the semantic loss: the feature will inevitably lose nondiscriminative semantics in source domain, which is however discriminative in target domain. We use a causal view—transportability theory —to identify that such loss is in fact a confounding effect, which can only be removed by causal intervention. However, the theoretical solution provided by transportability is far from practical for UDA, because it requires the stratification and representation of the unobserved confounder that is the cause of the domain gap. To this end, we propose a practical solution: Transporting Causal Mechanisms (TCM), to identify the confounder stratum and representations by using the domain-invariant disentangled causal mechanisms, which are discovered in an unsupervised fashion.

    ICCV2021_SR Self-Regulation for Semantic Segmentation
    Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua, Qianru Sun
    International Conference on Computer Vision, ICCV '21.
    [paper] [code]

    In this paper, we seek reasons for the two major failure cases in Semantic Segmentation (SS): 1) missing small objects or minor object parts, and 2) mislabeling minor parts of large objects as wrong classes. We have an interesting finding that Failure-1 is due to the underuse of detailed features and Failure-2 is due to the underuse of visual contexts. To help the model learn a better trade-off, we introduce several Self-Regulation (SR) losses for training SS neural networks. By “self”, we mean that the losses are from the model per se without using any additional data or supervision. By applying the SR losses, the deep layer features are regulated by the shallow ones to preserve more details; meanwhile, shallow layer classification logits are regulated by the deep ones to capture more semantics. We conduct extensive experiments on both weakly and fully supervised SS tasks, and the results show that our approach consistently surpasses the baselines.

    MM2021_FoodSef A Large-Scale Benchmark for Food Image Segmentation
    Xiongwei Wu, Xin Fu, Ying Liu, Ee-Peng Lim, Steven C.H. Hoi, Qianru Sun
    The 29th ACM International Conference on Multimedia, ACM MM '21. (Main Track)
    [paper] [project] [challenge]

    Food image segmentation is a critical and indispensible task for developing health-related applications such as estimating food calories and nutrients. Existing food image segmentation models are underperforming due to two reasons: (1) there is a lack of high quality food image datasets with fine-grained ingredient labels and pixel-wise location masks -- the existing datasets either carry coarse ingredient labels or are small in size; and (2) the complex appearance of food makes it difficult to localize and recognize ingredients in food images, e.g., the ingredients may overlap one another in the same image, and the identical ingredient may appear distinctly in different food images. In this work, we build a new food image dataset FoodSeg103 (and its extension FoodSeg154) containing 9,490 images. We annotate these images with 154 ingredient classes and each image has an average of 6 ingredient labels and pixel-wise masks. In addition, we propose a multi-modality pre-training approach called ReLeM that explicitly equips a segmentation model with rich and semantic food knowledge.

    CVPR2021_AANets Adaptive Aggregation Networks for Class-Incremental Learning
    Yaoyao Liu, Bernt Schiele, Qianru Sun
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR '21.
    [paper] [supp] [code]

    Class-Incremental Learning (CIL) aims to learn a classification model with the number of classes increasing phase-by-phase. An inherent problem in CIL is the stability-plasticity dilemma between the learning of old and new classes, i.e., high-plasticity models easily forget old classes, but high-stability models are weak to learn new classes. We alleviate this issue by proposing a novel network architecture called Adaptive Aggregation Networks (AANets) in which we explicitly build two types of residual blocks at each residual level (taking ResNet as the baseline architecture): a stable block and a plastic block. We aggregate the output feature maps from these two blocks and then feed the results to the next-level blocks. We adapt the aggregation weights in order to balance these two types of blocks, i.e., to balance stability and plasticity, dynamically.

    CVPR2021_GCM Counterfactual Zero-Shot and Open-Set Visual Recognition
    Zhongqi Yue, Tan Wang, Qianru Sun, Xian-Sheng Hua, Hanwang Zhang
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR '21.
    [paper] [supp] [code]

    We present a novel counterfactual framework for both Zero-Shot Learning and Open-Set Recognition, whose common challenge is generalizing to the unseen-classes by only training on the seen-classes. Our idea stems from the observation that the generated samples for unseen-classes are often out of the true distribution, which causes severe recognition rate imbalance between the seen-class (high) and unseen-class (low). We show that the key reason is that the generation is not Counterfactual Faithful, and thus we propose a faithful one, whose generation is from the sample-specific counterfactual question: What would the sample look like, if we set its class attribute to a certain class, while keeping its sample attribute unchanged? Thanks to the faithfulness, we can apply the Consistency Rule to perform unseen/seen binary classification, by asking: Would its counterfactual still look like itself? If "yes", the sample is from a certain class, and "no" otherwise.


    NeurIPS2020_CONTA Causal Intervention for Weakly-Supervised Semantic Segmentation
    Dong Zhang, Hanwang Zhang, Jinhui Tang, Xian-Sheng Hua, Qianru Sun
    Neural Information Processing Systems, NeurIPS '20. (Oral Presentation, 1.1%)
    [paper] [code]

    We present a causal inference framework to improve Weakly-Supervised Semantic Segmentation (WSSS). Specifically, we aim to generate better pixel-level pseudo-masks by using only image-level labels -- the most crucial step in WSSS. We attribute the cause of the ambiguous boundaries of pseudo-masks to the confounding context, e.g., the correct image-level classification of "horse" and "person" may be not only due to the recognition of each instance, but also their co-occurrence context, making the model inspection (e.g., CAM) hard to distinguish between the boundaries. Inspired by this, we propose a structural causal model to analyze the causalities among images, contexts, and class labels. Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification and thus provide better pseudo-masks as ground-truth for the subsequent segmentation model.

    NeurIPS2021_IFSL Interventional Few-Shot Learning
    Zhongqi Yue, Hanwang Zhang, Qianru Sun, Xian-Sheng Hua
    Neural Information Processing Systems, NeurIPS '20.
    [paper] [code]

    We uncover an ever-overlooked deficiency in the prevailing Few-Shot Learning (FSL) methods: the pre-trained knowledge is indeed a confounder that limits the performance. This finding is rooted from our causal assumption: a Structural Causal Model (SCM) for the causalities among the pre-trained knowledge, sample features, and labels. Thanks to it, we propose a novel FSL paradigm: Interventional FewShot Learning (IFSL). Specifically, we develop three effective IFSL algorithmic implementations based on the backdoor adjustment, which is essentially a causal intervention towards the SCM of many-shot learning: the upper-bound of FSL in a causal view. It is worth noting that the contribution of IFSL is orthogonal to existing fine-tuning and meta-learning based FSL methods, hence IFSL can improve all of them, achieving a new 1-/5-shot state-of-the-art.

    ECCV2020_FPT Feature Pyramid Transformer
    Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xian-Sheng Hua, Qianru Sun
    European Conference on Computer Vision, ECCV '20.
    [paper] [code]

    Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the CNN's increasing receptive fields or actively encoded by non-local convolution. Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales. To this end, we propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer. It transforms any feature pyramid into another feature pyramid of the same size but with richer contexts, by using three specially designed transformers in self-level, top-down, and bottom-up interaction fashion. FPT serves as a generic visual backbone with fair computational overhead.

    ECCV2020_E3BM An Ensemble of Epoch-wise Empirical Bayes for Few-shot Learning
    Yaoyao Liu, Bernt Schiele, Qianru Sun
    European Conference on Computer Vision, ECCV '20.
    [paper] [code]

    Few-shot learning aims to train efficient predictive models with a few examples. The lack of training data leads to poor models that perform high-variance or low-confidence predictions. In this paper, we propose to meta-learn the ensemble of epoch-wise empirical Bayes models (E3BM) to achieve robust predictions. "Epoch-wise" means that each training epoch has a Bayes model whose parameters are specifically learned and deployed. "Empirical" means that the hyperparameters, e.g., used for learning and ensembling the epoch-wise models, are generated by hyperprior learners conditional on task-specific data. We introduce four kinds of hyperprior learners by considering inductive vs. transductive, and epoch-dependent vs. epoch-independent, in the paradigm of meta-learning. Our ablation study shows that both "epoch-wise ensemble" and "empirical" encourage high efficiency and robustness in the model performance.

    CVPR2020_Mnemonics Mnemonics Training: Multi-Class Incremental Learning without Forgetting
    Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, Qianru Sun
    The 33rd Conference on Computer Vision and Pattern Recognition, CVPR '20. (Oral Presentation, 4%)
    [paper] [supp.] [video] [code]

    Multi-Class Incremental Learning aims to learn new concepts by incrementally updating a model trained on previous concepts. However, there is an inherent trade-off to effectively learning new concepts without catastrophic forgetting of previous ones. To alleviate this issue, it has been proposed to keep around a few examples of the previous concepts but the effectiveness of this approach heavily depends on the representativeness of these examples. This paper proposes a novel and automatic framework we call mnemonics, where we parameterize exemplars and make them optimizable in an end-to-end manner. We train the framework through bilevel optimizations, i.e., model-level and exemplar-level. We conduct extensive experiments on three MCIL benchmarks. Interestingly and quite intriguingly, the mnemonics exemplars tend to be on the boundaries between classes.

    CVPR2020_VC-RCNN Visual Commonsense R-CNN
    Tan Wang, Jianqiang Huang, Hanwang Zhang, Qianru Sun
    The 33rd Conference on Computer Vision and Pattern Recognition, CVPR '20.
    [paper] [supp.] [video] [code]

    We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., by Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by causal intervention: P(Y|do(X)), while others are by the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn ``sense-making'' knowledge --- like "chair" can be sat --- while not just common co-occurrences --- "chair" is likely to exist if "table" is observed.


    NeurIPS2019_LST Learning to Self-Train for Semi-Supervised Few-Shot Classification
    Xinzhe Li, Qianru Sun, Yaoyao Liu, Shibao Zheng, Tat-Seng Chua, Bernt Schiele
    The 33rd Annual Conference on Neural Information Processing Systems, NeurIPS'19.
    [paper] [slides] [poster] [code]

    Few-shot classification (FSC) is challenging due to the scarcity of labeled training data (e.g. only one labeled data point per class). Meta-learning has shown to achieve promising results by learning to initialize a classification model for FSC. In this paper we propose a novel semi-supervised meta-learning method called learning to self-train (LST) that leverages unlabeled data and specifically meta-learns how to cherry-pick and label such unsupervised data to further improve performance. To this end, we train the LST model through a large number of semi-supervised few-shot tasks. On each task, we train a few-shot model to predict pseudo labels for unlabeled data, and then iterate the self-training steps on labeled and pseudo-labeled data with each step followed by fine-tuning. We additionally learn a soft weighting network (SWN) to optimize the self-training weights of pseudo labels so that better ones can contribute more to gradient descent optimization.

    CVPR2019_MTL Meta-Transfer Learning for Few-Shot Learning
    Qianru Sun, Yaoyao Liu, Tat-Seng Chua, Bernt Schiele
    The 32nd Conference on Computer Vision and Pattern Recognition, CVPR'19.
    [paper] [poster] [code]

    Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, meta-learning typically uses shallow neural networks (SNNs), thus limiting its effectiveness. In this paper we propose a novel few-shot learning method called meta-transfer learning (MTL) which learns to adapt a deep NN for few shot learning tasks. Specifically, meta refers to training multiple tasks, and transfer is achieved by learning scaling and shifting functions of DNN weights for each task. In addition, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum for MTL.


    ECCV2018_Face A Hybrid Model for Identity Obfuscation by Face Replacement
    Qianru Sun, Ayush Tewari, Weipeng Xu, Mario Fritz, Christian Theobalt, Bernt Schiele
    The 15th European Conference on Computer Vision, ECCV'18.
    [paper] [decoder code]

    As more and more personal photos are shared and tagged in social media, avoiding privacy risks such as unintended recognition, becomes increasingly challenging. We propose a new hybrid approach to obfuscate identities in photos by head replacement. Our approach combines state of the art parametric face synthesis with latest advances in Generative Adversarial Networks (GAN) for data-driven image synthesis. On the one hand, the parametric part of our method gives us control over the facial parameters and allows for explicit manipulation of the identity. On the other hand, the data-driven aspects allow for adding fine details and overall realism as well as seamless blending into the scene context. In our experiments we show highly realistic output of our system that improves over the previous state of the art in obfuscation rate while preserving a higher similarity to the original image content.

    CVPR2018_DPIG Disentangled Person Image Generation
    Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, Mario Fritz
    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR'18. (Spotlight Presentation)
    [paper] [code] [project]

    Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor, respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on the Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.

    CVPR2018_Head Natural and Effective Obfuscation by Head Inpainting
    Qianru Sun, Liqian Ma, Seong Joon Oh, Luc Van Gool, Bernt Schiele, Mario Fritz
    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR'18.
    [paper] [decoder code] [PDM code]

    As more and more personal photos are shared online, being able to obfuscate identities in such photos is becoming a necessity for privacy protection. People have largely resorted to blacking out or blurring head regions, but they result in poor user experience while being surprisingly ineffective against state of the art person recognizers. In this work, we propose a novel head inpainting obfuscation technique. Generating a realistic head inpainting in social media photos is challenging because subjects appear in diverse activities and head orientations. We thus split the task into two sub-tasks: (1) facial landmark generation from image context (e.g. body pose) for seamless hypothesis of sensible head pose, and (2) facial landmark conditioned head inpainting. We verify that our inpainting method generates realistic person images, while achieving superior obfuscation performance against automatic person recognizers.


    NeurIPS2017_PG2 Pose Guided Person Image Generation
    Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, Luc Van Gool
    The 31st Annual Conference on Neural Information Processing, NIPS'17.
    [paper] [slides] [code]

    This paper proposes the novel Pose Guided Person Generation Network (PG2) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose. Our generation framework PG2 utilizes the pose information explicitly and consists of two key stages: pose integration and image refinement. In the first stage the condition image and the target pose are fed into a U-Net-like network to generate an initial but coarse image of the person with the target pose. The second stage then refines the initial and blurry result by training a U-Net-like generator in an adversarial way. Extensive experimental results on both 128×64 re-identification images and 256×256 fashion photos show that our model generates high-quality person images with convincing details.

    CVPR2017_Social A Domain Based Approach to Social Relation Recognition
    Qianru Sun, Bernt Schiele, Mario Fritz
    2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR'17.
    [paper] [code] [project]

    Social relations are the foundation of human daily life. Developing techniques to analyze such relations from visual data bears great potential to build machines that better understand us and are capable of interacting with us at a social level. Previous investigations have remained partial due to the overwhelming diversity and complexity of the topic and consequently have only focused on a handful of social relations. In this paper, we argue that the domain-based theory from social psychology is a great starting point to systematically approach this problem. The theory provides coverage of all aspects of social relations and equally is concrete and predictive about the visual attributes and behaviors defining the relations included in each domain. We provide the first dataset built on this holistic conceptualization of social life that is composed of a hierarchical label space of social domains and social relations. We also contribute the first models to recognize such domains and relations and find superior performance for attribute based features. Beyond the encouraging performance of the attribute based approach, we also find interpretable features that are in accordance with the predictions from social psychology literature. Beyond our findings, we believe that our contributions more tightly interleave visual recognition and social psychology theory that has the potential to complement the theoretical work in the area with empirical and data-driven models of social life.

    Full Publications

    Awards and Funding

  • Aug 2022, DSO Research Grant (DSO National Laboratories)
  • Oct 2021, Outstanding Reviewer (NeurIPS 2021)
  • Aug 2021, Alibaba Innovative Research Grant (Alibaba Group)
  • Jul 2021, Lee Kong Chian Fellowship (SMU)
  • Mar 2021, Outstanding Reviewer (ICLR 2021)
  • Nov 2020, Young Independent Research Grant (A*STAR)
  • Feb 2020, Alibaba Innovative Research Grant (Alibaba Group)
  • Mar 2016, Lise-Meitner Award for Excellent Women in Computer Science (MPI for Informatics)
  • Qianru's Services

  • Pattern Recognition: Associate Editor
  • BMVC'22, AAAI'22, CVPR'22: Area Chair
  • Causality in Vision @CVPR'21@ECCV'22, Organization Committee
  • ACM MM'21 (Chengdu), Organization Committee (Proceeding Co-Chair)
  • ACM MM Asia'20 (Singapore), Organization Committee (Program Co-Chair)
  • ICML'21-, ICLR'21-, NeurIPS'20-, ECCV'20-, IJCAI'20-, AAAI'20-, CVPR'18-, ICCV'17-: Reviewer
  • IEEE Trans on PAMI/TMM/TCSVT/TIP/NNLS, IJCV, PR, PR Letters, Reviewer
  • Lise Meitner Award (MPII) 2018, Organization Committee
  • Qianru's Talks

  • May 2022, Keynote, ICLR 2022 Workshop of Objects, Structure, and Causality (OSC). "Learning Invariance from Insufficient Data" [slides]
  • Jul 2020, CSIAM Big Data & AI Forum. "Learning to Learn" [slides]
  • Jan 2018, ICMR 2018 Tutorial. "Objects, Relationships, and Context in Visual Data" [slides]
  • Dec 2017, DVMM Lab at Columbia University. "Pose Guided Person Image Generation" [slides]
  • Jul 2017, Keynote, CVPR 2017 ODAR Workshop. "Domain Based Social Relation Recognition" [slides]
  • Jul 2017, MPII & Saarland University. "Your Photos Expose Your Social Life" [slides]
  • Group Seminars

  • 29 Apr 2022, AW Khai Loong. "Unsupervised Semantic Segmentation" [slides]
  • 22 Apr 2022, Ning Han. "Cross-Modal Video Retrieval" [slides]
  • 8 Apr 2022, Fengyun Wang. "Semantic Segmentation in RGB-D Data" [slides]
  • 25 Mar 2022, Zhaozheng Chen. "Weakly Supervised Semantic Segmentation" [slides]
  • 18 Mar 2022, Zilin Luo. "Class-Incremental Learning" [slides]
  • 11 Mar 2022, Sicheng Yu. "Masked Autoencoders" [slides]
  • 25 Feb 2022, Xin Zhao. "Source-Free Domain Adaptation" [slides]
  • 18 Feb 2022, Qing Wang. "Long-Tailed Recognition" [slides]
  • 11 Feb 2022, Yaoyao Liu. "Decoupling the representation learning and the classifier" [slides]
  • 2021 and before, not public
  • Collaborations

  • MReal Lab, Nanyang Technological University [homepage]
  • D2-CVML Group, MPI for Informatics [homepage]
  • Alibaba DAMO Academy [homepage]

  • Teaching

  • 2022   CS601 - Introduction to AI (MITB)
  • 2021-2022   CS701 - Deep Learning and Vision (PG)
  • 2020-2021   CS470 - UResearch Projects (UG)
  • 2019-2022   IS111 - Introduction to Programming (UG)
  • 2020   IS112 - Data Management (UG)
  • Programmes

    UG [info] Master [info] PhD [info] MITB [info]