Munawar Hayat

Senior Lecturer, ARC DECRA Fellow

Monash University

Biography

I am a Senior Lecturer and ARC DECRA fellow at Faculty of IT, Monash University, Australia. I completed my PhD in Computer Vision from The University of Western Australia (UWA). My PhD thesis received multiple awards including the prestigious Robert Street Prize. My research interests are in computer vision, machine learning, deep learning, and affective computing.

Interests

Computer Vision
Deep Learning
Machine Learning
Affective Computing

Education

PhD in Computer Science, 2015
The University of Western Austraia
Masters in Space Science, 2011
Luleå Tekniska Universitet
BSc in Engineering, 2009
National University of Sciences & Technology

News

I received Dean’s Award for Excellence in Research by an Early Career Academic at FIT, Monash University.
Checkout Restormer: Efficient Transformer for High-Resolution Image Restoration in CVPR 2022.
See our paper Semantic-Aware Domain Generalized Segmentation in CVPR 2022.
Checkout Learning Enriched Features for Fast Image Restoration and Enhancement in IEEE TPAMI.
Checkout Towards Robust and Reproducible Active Learning Using Neural Networks in CVPR 2022.
Awarded ARC DECRA Fellowship 2021-2023 $425,613.
Received funding from Australian Reserch Council on a Discovery Project 2019-2021 $380,000.
See our recent work Deeply Supervised Discriminative Learning for Adversarial Defense in IEEE TPAMI.
See our paper Learning Enriched Features for Real Image Restoration and Enhancement in ECCV 2020.
See our paper titled A Self-supervised Approach for Adversarial Robustness in CVPR 2020.
See our paper titled CycleISP: Real Image Restoration via Improved Data Synthesis in CVPR 2020.
See our paper titled iTAML: An Incremental Task-Agnostic Meta-learning Approach in CVPR 2020.
Checkout Random path selection for continual learning in NeurIPS 2019.
2 Papers accepted in ICCV 2019.
2 Papers accepted in CVPR 2019.

Featured Publications

S. Zamir, A. Arora, S. Khan, M. Hayat, F. Khan, M. Yang

June 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)

Restormer: Efficient Transformer for High-Resolution Image Restoration

Since convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data, these models have been extensively applied to image restoration and related tasks. Recently, another class of neural architectures, Transformers, have shown significant performance gains on natural language and high-level vision tasks. While the Transformer model mitigates the shortcomings of CNNs (i.e., limited receptive field and inadaptability to input content), its computational complexity grows quadratically with the spatial resolution, therefore making it infeasible to apply to most image restoration tasks involving high-resolution images. In this work, we propose an efficient Transformer model by making several key designs in the building blocks (multi-head attention and feed-forward network) such that it can capture long-range pixel interactions, while still remaining applicable to large images. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks, including image deraining, single-image motion deblurring, defocus deblurring (single-image and dual-pixel data), and image denoising (Gaussian grayscale/color denoising, and real image denoising).

PDF Code Slides Video

D. Peng, Y. Lei, M. Hayat, Y. Guo, W. Li

June 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)

Semantic-Aware Domain Generalized Segmentation

Deep models trained on source domain lack generalization when evaluated on unseen target domains with different data distributions. The problem becomes even more pronounced when we have no access to target domain samples for adaptation. In this paper, we address domain generalized semantic segmentation, where a segmentation model is trained to be domain-invariant without using any target domain data. Existing approaches to tackle this problem standardize data into a unified distribution. We argue that while such a standardization promotes global normalization, the resulting features are not discriminative enough to get clear segmentation boundaries. To enhance separation between categories while simultaneously promoting domain invariance, we propose a framework including two novel modules: Semantic-Aware Normalization (SAN) and Semantic-Aware Whitening (SAW). Specifically, SAN focuses on category-level center alignment between features from different image styles, while SAW enforces distributed alignment for the already center-aligned features. With the help of SAN and SAW, we encourage both intra-category compactness and inter-category separability. We validate our approach through extensive experiments on widely-used datasets (i.e. GTAV, SYNTHIA, Cityscapes, Mapillary and BDDS). Our approach shows significant improvements over existing state-of-the-art on various backbone networks.

PDF Code

P. Munjal, N. Hayat, M. Hayat, J. Sourati, S. Khan

June 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)

Towards Robust and Reproducible Active Learning Using Neural Networks

Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we demonstrate that under identical experimental settings, different types of AL algorithms (uncertainty based, diversity based, and committee based) produce an inconsistent gain over random sampling baseline. Through a variety of experiments, controlling for sources of stochasticity, we show that variance in performance metrics achieved by AL algorithms can lead to results that are not consistent with the previously reported results. We also found that under strong regularization, AL methods show marginal or no advantage over the random sampling baseline under a variety of experimental conditions. Finally, we conclude with a set of recommendations on how to assess the results using a new AL algorithm to ensure results are reproducible and robust under changes in experimental conditions. We share our codes to facilitate AL evaluations. We believe our findings and recommendations will help advance reproducible research in AL using neural networks.

PDF Code

S. Zamir, A. Arora, S. Khan, M. Hayat, F. Khan, M. Yang, L. Shao

April 2022 IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE TPAMI)

Learning Enriched Features for Fast Image Restoration and Enhancement

PDF Code

M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F. Khan, M. Yang

August 2021 Advances in Neural Information Processing Systems (NeurIPs 2021)

Intriguing Properties of Vision Transformers

Vision transformers (ViT) have demonstrated impressive performance across various machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms.

PDF Code Poster Video

M. Naseer, S. Khan, M. Hayat, F. Khan, F. Porikli

August 2021 Proceedings of the IEEE International Conference on Computer Vision (ICCV 2021)

On Generating Transferable Targeted Perturbations

While the untargeted black-box transferability of adversarial perturbations has been extensively studied before, changing an unseen model's decisions to a specific targeted class remains a challenging feat. In this paper, we propose a new generative approach for highly transferable targeted perturbations. We note that the existing methods are less suitable for this task due to their reliance on class-boundary information that changes from one model to another, thus reducing transferability. In contrast, our approach matches the perturbed image `distribution' with that of the target class, leading to high targeted transferability rates. To this end, we propose a new objective function that not only aligns the global distributions of source and target images, but also matches the local neighbourhood structure between the two domains. Based on the proposed objective, we train a generator function that can adaptively synthesize perturbations specific to a given input. Our generative approach is independent of the source or target domain labels, while consistently performs well against state-of-the-art methods on a wide range of attack settings. As an example, we achieve 32.63% target transferability from (an adversarially weak) VGG19BN to (a strong) WideResNet on ImageNet val. set, which is 4x higher than the previous best generative attack and 16x better than instance-specific iterative attack.

PDF Code

K. Ranasinghe, M. Naseer, M. Hayat, S. H. Khan, F. S. Khan

August 2021 Proceedings of the IEEE International Conference on Computer Vision (ICCV 2021)

Orthogonal Projection Loss

Deep neural networks have achieved remarkable performance on a range of classification tasks, with softmax cross-entropy (CE) loss emerging as the de-facto objective function. The CE loss encourages features of a class to have a higher projection score on the true class-vector compared to the negative classes.However, this is a relative constraint and does not explicitly force different class features to be well-separated. Motivated by the observation that ground-truth class representations in CE loss are orthogonal (one-hot encoded vectors), we develop a novel loss function termed “Orthogonal Projection Loss” (OPL) which imposes orthogonality in the feature space. OPL augments the properties of CE loss and directly enforces inter-class separation alongside intra-class clustering in the feature space through orthogonality constraints on the mini-batch level. As compared to other alternatives of CE, OPL offers unique advantages e.g., no additional learnable parameters, does not require careful negative mining and is not sensitive to the batch size. Given the plug-and-play nature of OPL, we evaluate it on a diverse range of tasks including image recognition (CIFAR-100), large-scale classification (ImageNet), domain generalization (PACS) and few-shot learning (miniImageNet, CIFAR-FS, tiered-ImageNet and Meta-dataset) and demonstrate its effectiveness across the board. Furthermore, OPL offers better robustness against practical nuisances such as adversarial attacks and label noise.

PDF Code

S. Khan, M. Naseer, M. Hayat, S. Zamir, F. Khan, M. Shah

January 2021 arXiv preprint arXiv:2101.01169

Transformers in Vision: A Survey

Astounding results from transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. This has led to exciting progress on a number of tasks while requiring minimal inductive biases in the model design. This survey aims to provide a comprehensive overview of the transformer models in the computer vision discipline and assumes little to no prior background in the field. We start with an introduction to fundamental concepts behind the success of transformer models i.e., self-supervision and self-attention. Transformer architectures leverage self-attention mechanisms to encode long-range dependencies in the input domain which makes them highly expressive. Since they assume minimal prior knowledge about the structure of the problem, self-supervision using pretext tasks is applied to pre-train transformer models on large-scale (unlabelled) datasets. The learned representations are then fine-tuned on the downstream tasks, typically leading to excellent performance due to the generalization and expressivity of encoded features. We cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering and visual reasoning), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

Munawar Hayat

Senior Lecturer, ARC DECRA Fellow

Biography

Interests

Education

News

Featured Publications

Publications

Contact