Yaoxin Li, PhD candidate
Mar 18, 2022, 11:30am EC4-2101A
Most action recognition methods focus only on the action segments to infer the action label and ignore the role of non-action segments. Some static contextual components in the video may cause models to overlook the motion patterns and lead to representational bias, i.e. learning the presence of certain objects in context instead of the action patterns for action classification. In this work, we suggest that non-action segments from untrimmed videos can be used to improve the action recognition performance and minimize representation bias by proposing a coarse-to-fine-to-finer discriminative learning mechanism. First, we attach intermediate branches to the early layers of a network, termed as “booster layers”, to inject coarse-level discriminative information of action and non-action segments across all videos. For fine-level of discrimination, we design a novel contrastive learning objective applied on the penultimate layer of the model that enlarges the distinctions between different segments in the same video.
Finally, we propose a clip matching module for a finer-level of discrimination that increases the variations within the action segment. Our proposed approach is model-agnostic and can be easily integrated with prior action recognition models. Experiments on multiple datasets (ActivityNet, HACS, FineAction) and backbones (TSN, TSM, TANet, TPN, Timesformer) show that it achieves consistent and significant improvements (0.33 – 4 %) over the baselines, with competitive results on ActivityNet (84.5%) and HACS (90.21%) datasets.