1

Generalized Contrastive Learning for Universal Multimodal Retrieval

Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality. This paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation by enforcing contrastive learning across all modalities within a mini-batch.

MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. MultiHuman-Testbench is a novel benchmark comprising 1,800 samples with carefully curated text prompts matched with 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. This benchmark enables comprehensive evaluation of multi-human image generation models.

ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints

Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). This paper proposes ConsNoTrainLoRA (CNTLoRA), a data-driven weight initialization method that expresses LoRA initialization as a domain shift problem, obtaining a closed-form estimate of LoRA weights that requires no training during initialization.

CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation

CustomKD proposes a novel knowledge distillation approach that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models (e.g., MobileNetV3). CustomKD customizes the well-generalized features inherent in LVFMs to a given student model to reduce model discrepancies, achieving state-of-the-art performances in unsupervised domain adaptation and semi-supervised learning scenarios.

Erasing Undesirable Influence in Diffusion Models

Diffusion models are highly effective at generating high-quality images but pose risks, such as the unintentional generation of NSFW (not safe for work) content. Although various techniques have been proposed to mitigate unwanted influences in diffusion models while preserving overall performance, achieving a balance between these goals remains challenging. In this work, we introduce EraseDiff, an algorithm designed to preserve the utility of the diffusion model on retained data while removing the unwanted information associated with the data to be forgotten.

HyperNet Fields: Efficiently Training Hypernetworks without Ground Truth by Learning Weight Trajectories

To efficiently adapt large models or to train generative models of neural representations, Hypernetworks have drawn interest. This work proposes a method to train hypernetworks without the need for any per-sample ground truth by learning a Hypernetwork 'Field' that estimates the entire trajectory of network weight training instead of simply its converged state, enabling more efficient and flexible hypernetwork training.

MARLIN: Masked Autoencoder for Facial Video Representation Learning

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos.

ProtoCon: Pseudo-label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-supervised Learning

Confidence-based pseudo-labeling is among the dominant approaches in semi-supervised learning (SSL). It relies on including high-confidence predictions made on unlabeled data as additional targets to train the model. We propose ProtoCon, a novel SSL method aimed at the less-explored label-scarce SSL where such methods usually underperform. ProtoCon refines the pseudo-labels by leveraging their nearest neighbours' information.

Transformer Scale Gate for Semantic Segmentation

Effectively encoding multi-scale contextual information is crucial for accurate semantic segmentation. Existing transformer-based segmentation models combine features across scales without any selection, where features on sub-optimal scales may degrade segmentation outcomes. Leveraging from the inherent properties of Vision Transformers, we propose a simple yet effective module, Transformer Scale Gate (TSG), to optimally combine multi-scale features.

Restormer: Efficient Transformer for High-Resolution Image Restoration

We propose Restormer, an efficient Transformer architecture for high-resolution image restoration that captures long-range pixel interactions while remaining computationally tractable. Our model achieves state-of-the-art results across multiple restoration tasks including deraining, motion deblurring, defocus deblurring, and image denoising.