Text-to-Visual Search Engine
A Text-to-Visual Search Engine that enables to search images given natural language sentences as a query.

Main Characteristic
The Text-to-Visual Search Engine takes a sentence as input and returns the images that are most inherent to it, accessing an available image database. It internally uses a multi-modal deep neural network for creating multi-modal descriptors, and a similarity search engine for efficiently computing similarities among them.
The pipeline first computes visual descriptors from an existing database of images (offline indexing phase). Then, during the search phase, a sentence is used to search the multi-modal index for related images.
Research areas
Physical AI
Last updated
21.04.2023 - 14:37
Detailed Description
Additional information:
- Instructions on how to setup the model can be found in the GitHub repository: URL.
- The original code for training the model on MS-COCO is here: URL.
- A working version of the model, with image features extracted from Flickr100K, is published as an Acumos component in the AI4EU Experiments marketplace: URL.
References
If you found this asset useful, please cite the following papers:
- Messina, N., Falchi, F., Esuli, A., and Amato, G. (2021, January). Transformer reasoning network for image-text matching and retrieval. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 5222-5229). IEEE.
- Messina, N., Amato, G., Esuli, A., Falchi, F., Gennaro, C., and Marchand-Maillet, S. (2021). Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(4), 1-23.
Trustworthy AI
It is a Transformer-based approach trained with images and sentences from public datasets.
GDPR Requirements
It is a Transformer-based approach trained with images and sentences from public datasets.