cff-version: 1.2.0
abstract: "<p>Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video Bag-Net on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order. In this repository, we provide our code, including the implementation of Video Bag-Net.</p>"
authors:
  - family-names: Strafforello
    given-names: Ombretta
  - family-names: Liu
    given-names: Xin
  - family-names: van Gemert
    given-names: Jan
    orcid: "https://orcid.org/0000-0002-6913-0482"
  - family-names: Schutte 
    given-names: Klamer
    orcid: "https://orcid.org/0000-0002-9954-0685"
title: "Code underlying the publication: &#34;Video BagNet: short temporal receptive fields increase robustness in long-term action recognition&#34;"
keywords:
version: 1
identifiers:
  - type: doi
    value: 10.4121/dc5e2fb8-6005-40cd-9afa-ff03c57d0a23.v1
license: CC0
date-released: 2024-05-24