Zero-Shot Visual Concept Recognition
Recognizing existing visual concepts from an image or a video

The concept recognition analysis service is used to detect one or more concepts that can describe a visual scene or an image.
Input: Image file or video file. You can specify which frames are to be processed from the video file.
Output: A sorted list of detected concepts will be returned for the image or each processed video frame. The list is sorted by the confidence score that is associated with each detected concept.
Model:
The concept recognition analysis service is using the multi-modal CLIP model from OpenAI in a zero-shot classification manner. This allows the user to modify or create a new concept-bank to be detected from query images or videos, without the need to retrain the model. Each concept in the bank can be defined using natural language sentences or paragraphs with up to 1024 characters (e.g., “A visual showing prediction of rainy weather”), and an associated label to be returned in the result (e.g., “weather forecast – rainy”). A default concept-bank is provided to detect many useful visual concepts seen in commonly broadcast content (e.g., news, sports, weather).
Results details:
A ranked list of recognized concepts with their confidence scores will be returned by the concept recognition analysis service. The number of recognized concepts may vary for different image entries.
References:
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger and Ilya Sutskever. “Learning Transferable Visual Models from Natural Language Supervision.” International Conference on Machine Learning (2021).