Video Classification

Classification of a video segment into a single human action (action recognition)

Library

Fraunhofer IAIS

Fraunhofer IAIS

Developed by

Fraunhofer-Gesellschaft

License

Other

see description below

Main Characteristic

This system will predict a single human action for the given video segment. This system at the moment only predicts one label per video. The extension of the work could result in a multi-label action recognition and the localization of actions.

Input: A video segment. From each video max. of 128 frames are sampled.
Output: Human action label

Research areas

Physical AI

Technical Categories

AI services Computer vision

Keywords

Last updated

17.11.2022 - 10:51

Detailed Description

Model: This system consists of two phases. In phase one, features from the sampled frames are extracted using a pretrained model (SimCLRv2 [1]). The pretrained models are trained on an image dataset for image classification task. In phase two, the extracted frame-level features go through a dimensionality reduction layer and transformer-based encoder along with a detection head to perform human action recognition. The training pipeline is trained on ActivityNet [2] with 200 action classes and Kinetics 400 [3] with 400 action classes.

Evaluation: This model achieves top1 and top5 accuracy of 64.7 and 86.4 on ActivityNet respectively.

License information: The training pipeline is trained at Fraunhofer. The underlying model for feature extraction (SimCLR v2) has Apache License Version 2.0.

References:

[1] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. (2020). Big Self-Supervised Models are Strong Semi-Supervised Learners. arXiv preprint arXiv:2006.10029.

[2] Heilbron, F., Escorcia, V., Ghanem, B., & Niebles, J. (2015). ActivityNet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 961-970).

[3] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vĳayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A.. (2017). The Kinetics Human Action Video Dataset.

Trustworthy AI

The mining service is (1) lawful, as it respects all applicable laws and regulations (e. g. software licenses of used open source components), (2) ethical, as it pursues the ethical goal of making information from documents easily accessible in digital form to the documents' owner.

GDPR Requirements

The mining service allows the user to extract textual context from video files. The software itself is GDPR compliant. video files are processed locally and all data remains on the user's local computer. However, the user must ensure that he has the authority to store and process the file, for example if it contains personal data or other sensitive, GDPR-relevant information.

Related Projects

AI4Media FlexiGrobots AI4EOSC