Text-to-Visual Search Engine

A Text-to-Visual Search Engine that enables to search images given natural language sentences as a query.

Docker container

Code for testing the model and building the gRPC-ready docker container

Container in AI4EU Experiments

Input/outputs and internal architecture of the module

Developed by

National Research Council - CNR

License

Apache License 2.0 (Apache-2.0)

Main Characteristic

The Text-to-Visual Search Engine takes a sentence as input and returns the images that are most inherent to it, accessing an available image database. It internally uses a multi-modal deep neural network for creating multi-modal descriptors, and a similarity search engine for efficiently computing similarities among them.

The pipeline first computes visual descriptors from an existing database of images (offline indexing phase). Then, during the search phase, a sentence is used to search the multi-modal index for related images.

Research areas

Physical AI

Technical Categories

Computer vision Machine learning Natural language processing

Keywords

Last updated

21.04.2023 - 14:37

Detailed Description

Additional information:

Instructions on how to setup the model can be found in the GitHub repository: URL.
The original code for training the model on MS-COCO is here: URL.
A working version of the model, with image features extracted from Flickr100K, is published as an Acumos component in the AI4EU Experiments marketplace: URL.

References

If you found this asset useful, please cite the following papers:

Messina, N., Falchi, F., Esuli, A., and Amato, G. (2021, January). Transformer reasoning network for image-text matching and retrieval. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 5222-5229). IEEE.
Messina, N., Amato, G., Esuli, A., Falchi, F., Gennaro, C., and Marchand-Maillet, S. (2021). Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(4), 1-23.

Trustworthy AI

It is a Transformer-based approach trained with images and sentences from public datasets.

GDPR Requirements

It is a Transformer-based approach trained with images and sentences from public datasets.

Related Projects

AI4Media