ℹ️Features and requirements

Unitary's models are multimodal meaning they can analyse text, images, and videos both in isolation and in combination. This includes audio/speech and text that can be identified by optical character recognition (OCR). For example, if you send a video that includes audio, text within the frames, and a caption, Unitary will analyse all these elements to give you a contextual analysis of that video.

Unitary API response times vary between sub-second and 24 hours. Please let us know if you want to negotiate a different response time.

Endpoint limitations

ModalityMaximum file sizeFormats supported



.mp4, .mpeg, .webm, .mov, .mkv, .m4v



.png, .jpeg




Recommendations on videos and images

  • For maximum efficiency, the recommend resolution is between 336 and 1200 pixels per side.

  • To increase accuracy, please send any text submitted by your users alongside the video or image in the same API request. Unitary's models will analyse everything together.

  • Media files can either be sent as multi-part form uploads or, preferably, by including a url field in the request where the media file can be downloaded from. Pre-signed object-storage URLs are supported.

  • If you require sub-second latency, please use the resource URL server to implement partial content delivery using the “range” HTTP header.

Add-on Features

Please let us know if you'd like to start using any of the following add-on features:

  • Include the Optical Character Recognition (OCR) transcription in the API response. OCR refers to any text that appears on the image or video. Examples include captions for translations, words displayed on a T-shirt, or handwritten content. Unitary's models always check images and videos for OCR content that can be fed into Unitary's models. This OCR transcript can be shared in the classification results.

  • Include Speech Transcriptions in the API response. Audio transcriptions are a literal transcription of the speech present in a video. Unitary's API feeds this speech into Unitary's models. These speech transcriptions can be shared in the classification results. This is only available in English and Spanish with more languages potentially available in future.

Last updated