Video Classification
Classification of a video segment into a single human action (action recognition)

This system will predict a single human action for the given video segment. This system at the moment only predicts one label per video. The extension of the work could result in a multi-label action recognition and the localization of actions.
- Input: A video segment. From each video max. of 128 frames are sampled.
- Output: Human action label
Model: This system consists of two phases. In phase one, features from the sampled frames are extracted using a pretrained model (SimCLRv2 [1]). The pretrained models are trained on an image dataset for image classification task. In phase two, the extracted frame-level features go through a dimensionality reduction layer and transformer-based encoder along with a detection head to perform human action recognition. The training pipeline is trained on ActivityNet [2] with 200 action classes and Kinetics 400 [3] with 400 action classes.
Evaluation: This model achieves top1 and top5 accuracy of 64.7 and 86.4 on ActivityNet respectively.
License information: The training pipeline is trained at Fraunhofer. The underlying model for feature extraction (SimCLR v2) has Apache License Version 2.0.
References:
[1] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. (2020). Big Self-Supervised Models are Strong Semi-Supervised Learners. arXiv preprint arXiv:2006.10029.
[2] Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J. (2015). ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 961-970).
[3] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A.. (2017). The Kinetics Human Action Video Dataset.