We propose a simple dynamic distillation-based approach to facilitate unlabeled images from the novel/base dataset. We impose consistency regularization by calculating predictions from the weakly-augmented versions of the unlabeled images from a teacher network and matching it with the strongly augmented versions of the same images from a student network. The parameters of the teacher network are updated as exponential moving average of the parameters of the student network. We show that the proposed network learns representation that can be easily adapted to the target domain even though it has not been trained with target-specific classes during the pretraining phase.
We conduct a comprehensive study on the transferability of learned representations of different contrastive approaches for linear evaluation, full-network transfer, and few-shot recognition on 12 downstream datasets from different domains, and object detection tasks on MSCOCO and VOC0712. The results show that the contrastive approaches learn representations that are easily transferable to a different downstream task. We further observe that the joint objective of self-supervised contrastive loss with cross-entropy/supervised-contrastive loss leads to better transferability of these models over their supervised counterparts.
We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. We propose a simple solution to utilize unlabeled images from the novel/base dataset by calculating pseudo soft-label from the weakly-augmented version of the unlabeled image and compare it with the strongly augmented version. Our model outperforms the current state-of-the art method by 2.7% for 5-shot and 3.6% for 1-shot classification in the BSCD-FSL benchmark.
Most existing methods in weakly-supervised action localization rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approach has a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. In this paper, we present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address these issues.
We propose a Generative Adversarial Network with a dual-order attention model to detect and localize copy-move forgeries. In the generator, the firstorder attention is designed to capture copy-move location information, and the second-order attention exploits more discriminative features for the patch co-occurrence. The discriminator network is designed to further ensure more accurate localization results.
It is expensive and time-consuming to annotate both action labels and temporal boundaries of videos. We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
Automatic algorithms for tracking and associating passengers and their divested objects at an airport security screening checkpoint would have great potential for improving checkpoint efficiency, including flow analysis, theft detection, line-of-sight maintenance, and risk-based screening. In this paper, we present algorithms for these tracking and association problems and demonstrate their effectiveness in a full-scale physical simulation of an airport security screening checkpoint.