1

Semantic-Aware Domain Generalized Segmentation

We address domain generalized semantic segmentation through Semantic-Aware Normalization (SAN) and Semantic-Aware Whitening (SAW) modules. Our framework promotes both intra-category compactness and inter-category separability, achieving significant improvements over state-of-the-art methods on widely-used datasets including GTAV, SYNTHIA, and Cityscapes.

Towards Robust and Reproducible Active Learning Using Neural Networks

Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we demonstrate that under identical experimental settings, different types of AL algorithms (uncertainty based, diversity based, and committee based) produce an inconsistent gain over random sampling baseline. Through a variety of experiments, controlling for sources of stochasticity, we show that variance in performance metrics achieved by AL algorithms can lead to results that are not consistent with the previously reported results. We also found that under strong regularization, AL methods show marginal or no advantage over the random sampling baseline under a variety of experimental conditions. Finally, we conclude with a set of recommendations on how to assess the results using a new AL algorithm to ensure results are reproducible and robust under changes in experimental conditions. We share our codes to facilitate AL evaluations. We believe our findings and recommendations will help advance reproducible research in AL using neural networks.

Towards Robust and Reproducible Active Learning using Neural Networks

D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations

This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on two benchmarks: THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on both datasets, achieving gains as high as 3.6% in terms of mean average precision on THUMOS14.

Intriguing Properties of Vision Transformers

We systematically study the robustness properties of Vision Transformers, revealing their remarkable resilience to occlusions, perturbations, and domain shifts. ViTs demonstrate significantly less texture bias than CNNs, achieve human-level shape recognition capabilities, and enable accurate semantic segmentation without pixel-level supervision through flexible self-attention mechanisms.

On Generating Transferable Targeted Perturbations

We propose a novel generative approach for highly transferable targeted adversarial perturbations. Unlike existing methods that rely on class-boundary information, our approach matches the perturbed image distribution with the target class by aligning both global distributions and local neighborhood structures. Our method achieves 4x higher target transferability than previous best generative attacks and 16x better than instance-specific iterative attacks on ImageNet.

Orthogonal Projection Loss

We propose Orthogonal Projection Loss (OPL), a novel loss function that enforces inter-class separation and intra-class clustering through orthogonality constraints in the feature space. Unlike standard cross-entropy loss, OPL explicitly separates class features while requiring no additional learnable parameters. We demonstrate OPL's effectiveness across diverse tasks including image recognition, domain generalization, and few-shot learning, with improved robustness against adversarial attacks and label noise.

Multi-Stage Progressive Image Restoration

Image restoration tasks demand a complex balance between spatial details and high-level contextualized information while recovering images. In this paper, we propose a novel synergistic design that can optimally balance these competing goals. Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs, thereby breaking down the overall recovery process into more manageable steps.

Transformers in Vision: A Survey

A comprehensive survey of transformer models in computer vision, covering fundamental concepts of self-attention and self-supervision. We review extensive applications across recognition, generative modeling, multi-modal tasks, video processing, low-level vision, and 3D analysis, providing insights into architectural designs and future research directions.

Synthesizing the Unseen for Zero-shot Object Detection

The existing zero-shot detection approaches project visual features to the semantic domain for seen objects, hoping to map unseen objects to their corresponding semantics during inference. However, since the unseen objects are never visualized during training, the detection model is skewed towards seen content, thereby labeling unseen as background or a seen class. In this work, we propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. Consequently, the major challenge becomes, how to accurately synthesize unseen objects merely using their class semantics? Towards this ambitious goal, we propose a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them. Further, using a unified model, we ensure the synthesized features have high diversity that represents the intra-class differences and variable localization precision in the detected bounding boxes. We test our approach on three object detection benchmarks, PASCAL VOC, MSCOCO, and ILSVRC detection, under both conventional and generalized settings, showing impressive gains over the state-of-the-art methods.