Model: This system consists of two phases. In phase one, features from the sampled frames are extracted using a pretrained model (SimCLRv2 ). The pretrained models are trained on an image dataset for image classification task. In phase two, the extracted frame-level features go through a dimensionality reduction layer and transformer-based encoder along with a detection head to perform human action recognition. The training pipeline is trained on ActivityNet  with 200 action classes and Kinetics 400  with 400 action classes.
Evaluation: This model achieves top1 and top5 accuracy of 64.7 and 86.4 on ActivityNet respectively.
License information: The training pipeline is trained at Fraunhofer. The underlying model for feature extraction (SimCLR v2) has Apache License Version 2.0.
 Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. (2020). Big Self-Supervised Models are Strong Semi-Supervised Learners. arXiv preprint arXiv:2006.10029.
 Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J. (2015). ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 961-970).
 Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vĳayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A.. (2017). The Kinetics Human Action Video Dataset.