This blog is co-authored by James Le, Head of Developer Experience – TwelveLabs
The exponential growth of video content has created both opportunities and challenges. Content creators, marketers, and researchers are now faced with the daunting task of efficiently searching, analyzing, and extracting valuable insights from vast video libraries. Traditional search methods such as keyword-based text search often fall short when dealing with video content to analyze the visual content, spoken words, or contextual elements within the video itself, leaving organizations struggling to effectively search through and unlock the full potential of their multimedia assets.
With the integration of TwelveLabs’ Embed API and Amazon OpenSearch Service, we can interact with and derive value from video content. By using TwelveLabs‘ advanced AI-powered video understanding technology and OpenSearch Service’s search and analytics capabilities, we can now perform advanced video discovery and gain deeper insights.
In this blog post, we show you the process of integrating TwelveLabs Embed API with OpenSearch Service to create a multimodal search solution. You’ll learn how to generate rich, contextual embeddings from video content and use OpenSearch Service’s vector database capabilities to enable search functionalities. By the end of this post, you’ll be equipped with the knowledge to implement a system that can transform the way your organization handles and extracts value from video content.
TwelveLabs’ multimodal embeddings process visual, audio, and text signals together to create unified representations, capturing the direct relationships between these modalities. This unified approach delivers precise, context-aware video search that matches human understanding of video content. Whether you’re a developer looking to enhance your applications with advanced video search capabilities, or a business leader seeking to optimize your content management strategies, this post will provide you with the tools and steps to implement multimodal search for your organizational data.
About TwelveLabs
TwelveLabs is an Advanced AWS Partner and AWS Marketplace Seller that offers video understanding solutions. Embed API is designed to revolutionize how you interact with and extract value from video content.
At its core, the Embed API transforms raw video content into meaningful, searchable data by using state-of-the-art machine learning models. These models extract and represent complex video information in the form of dense vector embeddings, each a standard 1024-dimensional vector that captures the essence of the video content across multiple modalities (image, text, and audio).
Key features of TwelveLabs Embed API
Below are the key features of TwelveLabs Embed API:
- Multimodal understanding: The API generates embeddings that encapsulate various aspects of the video, including visual expressions, body language, spoken words, and overall context.
- Temporal coherence: Unlike static image-based models, TwelveLabs’ embeddings capture the interrelations between different modalities over time, providing a more accurate representation of video content.
- Flexibility: The API supports native processing of all modalities present in videos, eliminating the need for separate text-only or image-only models.
- High performance: By using a video-native approach, the Embed API provides more accurate and temporally coherent interpretation of video content compared to traditional CLIP-like models.
Benefits and use cases
The Embed API offers numerous advantages for developers and businesses working with video content:
- Enhanced Search Capabilities: Enable powerful multimodal search across video libraries, allowing users to find relevant content based on visual, audio, or textual queries.
- Content Recommendation: Improve content recommendation systems by understanding the deep contextual similarities between videos.
- Scene Detection and Segmentation: Automatically detect and segment different scenes within videos for easier navigation and analysis.
- Content Moderation: Efficiently identify and flag inappropriate content across large video datasets.
Use cases include:
- Anomaly detection
- Diversity sorting
- Sentiment analysis
- Recommendations
Architecture overview
The architecture for using TwelveLabs Embed API and OpenSearch Service for advanced video search consists of the following components:
- TwelveLabs Embed API: This API generates 1024-dimensional vector embeddings from video content, capturing visual, audio, and textual elements.
- OpenSearch Vector Database: Stores and indexes the video embeddings generated by TwelveLabs.
- Secrets Manager to store secrets such as API access keys, and the Amazon OpenSearch Service username and password.
- Integration of TwelveLabs SDK and the OpenSearch Service client to process videos, generate embeddings, and index them in OpenSearch Service.
The following diagram illustrates:
- A video file is stored in Amazon Simple Storage Service (Amazon S3). Embeddings of the video file are created using TwelveLabs Embed API.
- Embeddings generated from the TwelveLabs Embed API are now ingested to Amazon OpenSearch Service.
- Users can search the video embeddings using text, audio, or image. The user uses TwelveLabs Embed API to create the corresponding embeddings.
- The user searches video embeddings in Amazon OpenSearch Service and retrieves the corresponding vector.
The use case
For the demo, you will work on these videos: Robin bird forest Video by Federico Maderno from Pixabay and Island Video by Bellergy RC from Pixabay.
However, the use case can be expanded to various other segments. For example, the news organization struggles with:
- Needle-in-haystack searches through thousands of hours of archival footage
- Manual metadata tagging that misses nuanced visual and audio context
- Cross-modal queries such as querying a video collection using text or audio descriptions
- Rapid content retrieval for breaking news tie-ins
By integrating TwelveLabs Embed API with OpenSearch Service, you can:
- Generate 1024-dimensional embeddings capturing each video’s visual concepts. The embeddings are also capable of extracting spoken narration, on-screen text, and audio cues.
- Enable multimodal search capabilities allowing users to:
- Find specific demonstrations using text-based queries.
- Locate activities through image-based queries.
- Identify segments using audio pattern matching.
- Reduce search time from hours to seconds for complex queries.
Solution walkthrough
GitHub repository contains a notebook with detailed walkthrough instructions for implementing advanced video search capabilities by combining TwelveLabs’ Embed API with Amazon OpenSearch Service.
Prerequisites
Before you proceed further, verify that the following prerequisites are met:
- Confirm that you have an AWS account. Sign in to the AWS account.
- Create a TwelveLabs account because it will be required to get the API Key. TwelveLabs offer free tier pricing but you can upgrade if necessary to meet your requirement.
- Have an Amazon OpenSearch Service domain. If you don’t have an existing domain, you can create one using the steps outlined in our public documentation for Creating and Managing Amazon OpenSearch Service Domain. Make sure that the OpenSearch Service domain is accessible from your Python environment. You can also use Amazon OpenSearch Serverless for this use case and update the interactions to OpenSearch Serverless using AWS SDKs.
Step 1: Set up the TwelveLabs SDK
Start by setting up the TwelveLabs SDK in your Python environment:
- Obtain your API key from TwelveLabs Dashboard.
- Follow steps here to create a secret in AWS Secrets Manager. For example, name the secret as
TL_API_Key
.Note down the ARN or name of the secret (TL_API_Key
) to retrieve. To retrieve a secret from another account, you must use an ARN.For an ARN, we recommend that you specify a complete ARN rather than a partial ARN. See Finding a secret from a partial ARN.Use this value for theSecretId
in the code block below.
Step 2: Generate video embeddings
Use the Embed API to create multimodal embeddings that are contextual vector representations for your videos and texts. TwelveLabs video embeddings capture all the subtle cues and interactions between different modalities, including the visual expressions, body language, spoken words, and the overall context of the video, encapsulating the essence of all these modalities and their interrelations over time.
To create video embeddings, you must first upload your videos, and the platform must finish processing them. Uploading and processing videos require some time. Consequently, creating embeddings is an asynchronous process comprised of three steps:
- Upload and process a video: When you start uploading a video, the platform creates a video embedding task and returns its unique task identifier.
- Monitor the status of your video embedding task: Use the unique identifier of your task to check its status periodically until it’s completed.
- Retrieve the embeddings: After the video embedding task is completed, retrieve the video embeddings by providing the task identifier. Learn more in the docs.
Video processing implementation
This demo depends upon some video data. To use this, you will download two mp4 files and upload it to an Amazon S3 bucket.
- Click on the links containing the Robin bird forest Video by Federico Maderno from Pixabay and Island Video by Bellergy RC from Pixabay videos.
- Download the
21723-320725678_small.mp4
and2946-164933125_small.mp4
files. - Create an S3 bucket if you don’t have one already. Follow the steps in the Creating a bucket doc. Note down the name of the bucket and replace it the code block below (Eg.,
MYS3BUCKET
). - Upload the
21723-320725678_small.mp4
and2946-164933125_small.mp4
video files to the S3 bucket created in the step above by following the steps in the Uploading objects doc. Note down the name of the objects and replace it the code block below (Eg.,21723-320725678_small.mp4
and2946-164933125_small.mp4
)
Embedding generation process
With the SDK configured, generate embeddings for your video and monitor task completion with real-time updates. Here you use the Marengo 2.7 model to generate the embeddings:
Key features demonstrated include:
- Multimodal capture: 1024-dimensional vectors encoding visual, audio, and textual features
- Model specificity: Using Marengo-retrieval-2.7 optimized for scientific content
- Progress tracking: Real-time status updates during embedding generation
Expected output
Step 3: Set up OpenSearch
To enable vector search capabilities, you first need to set up an OpenSearch client and test the connection. Follow these steps:
Install the required libraries
Install the necessary Python packages for working with OpenSearch:
Configure the OpenSearch client
Set up the OpenSearch client with your host details and authentication credentials:
Expected output
If the connection is successful, you should see a message like the following:
This confirms that your OpenSearch client is properly configured and ready for use.
Step 4: Create an index in OpenSearch Service
Next, you create an index optimized for vector search to store the embeddings generated by the TwelveLabs Embed API.
Define the index configuration
The index is configured to support k-nearest neighbor (kNN) search with a 1024-dimensional vector field. You will these values for this demo but follow these best practices to find appropriate values for your application. Here’s the code:
Create the Index
Use the following code to create the index in OpenSearch Service:
Expected output
After running this code, you should see details of the newly created index. For example:
The following screenshot confirms that an index named twelvelabs_index
has been successfully created with a knn_vector
field of dimension 1024 and other specified settings. With these steps completed, you now have an operational OpenSearch Service domain configured for vector search. This index will serve as the repository for storing embeddings generated from video content, enabling advanced multimodal search capabilities.
Step 5: Ingest embeddings to the created index in OpenSearch Service
With the TwelveLabs Embed API successfully generating video embeddings and the OpenSearch Service index configured, the next step is to ingest these embeddings into the index. This process helps ensure that the embeddings are stored in OpenSearch Service and made searchable for multimodal queries.
Embedding ingestion process
The following code demonstrates how to process and index the embeddings into OpenSearch Service:
Explanation of the code
- Embedding extraction: The
video_embedding.segments
object contains a list of segment embeddings generated by the TwelveLabs Embed API. Each segment represents a specific portion of the video. - Document creation: For each segment, a document is created with a key (
embedding_field
) that stores its 1024-dimensional vector,video_title
with the title of the video,segment_start
andsegment_end
indicating the timestamp of the video segment, and asegment_id
. - Indexing in OpenSearch: The
index()
method uploads each document to thetwelvelabs_index
created earlier. Each document is assigned a unique ID (doc_id
) based on its position in the list.
Expected output
After the script runs successfully, you will see:
- A printed list of embeddings being indexed.
- A confirmation message:
Result
At this stage, all video segment embeddings are now stored in OpenSearch and ready for advanced multimodal search operations, such as text-to-video or image-to-video queries. This sets up the foundation for performing efficient and scalable searches across your video content.
Step 6: Perform vector search in OpenSearch Service
After embeddings are generated, you use it as a query vector to perform a kNN search in the OpenSearch Service index. Below are the functions to perform vector search and format the search results:
Key points:
- The
_source
field contains the video title, segment start, segment end, and segment id corresponding to the video embeddings. - The
embedding_field
in the query corresponds to the field where video embeddings are stored. - The k parameter specifies how many top results to retrieve based on similarity.
Step 7:Performing text-to-video search
You can use text-to-video search to retrieve video segments that are most relevant to a given textual query. In this solution, you will do this by using TwelveLabs’ text embedding capabilities and OpenSearch’s vector search functionality. Here’s how you can implement this step:
Generate text embeddings
To perform a search, you first need to convert the text query into a vector representation using the TwelveLabs Embed API:
Key points:
- The Marengo-retrieval-2.7 model is used to generate a dense vector embedding for the query.
- The embedding captures the semantic meaning of the input text, enabling effective matching with video embeddings.
Perform vector search in OpenSearch Service
After the text embedding is generated, you use it as a query vector to perform a kNN search in the OpenSearch index:
Expected output
The following illustrates similar results retrieved from OpenSearch.
Insights from results
- Each result includes a similarity score indicating how closely it matches the query, a time range indicating the start and end offset in seconds, and the video title.
- Observe that the top 2 results correspond to the robin bird video segments matching the Bird eating food query.
This process demonstrates how textual queries such as Bird eating food can effectively retrieve relevant video segments from an indexed library using TwelveLabs’ multimodal embeddings and OpenSearch’s powerful vector search capabilities.
Step 8: Perform audio-to-video search
You can use audio-to-video search to retrieve video segments that are most relevant to a given audio input. By using TwelveLabs’ audio embedding capabilities and OpenSearch’s vector search functionality, you can match audio features with video embeddings in the index. Here’s how to implement this step:
Generate audio embeddings
To perform the search, you first convert the audio input into a vector representation using the TwelveLabs Embed API:
Key points:
- The Marengo-retrieval-2.7 model is used to generate a dense vector embedding for the input audio.
- The embedding captures the semantic features of the audio, such as rhythm, tone, and patterns, enabling effective matching with video embeddings
Perform vector search in OpenSearch Service
After the audio embedding is generated, you use it as a query vector to perform a k-nearest neighbor (kNN) search in OpenSearch:
Expected output
The following shows video segments retrieved from OpenSearch Service based on their similarity to the input audio.
Here notice that segments from both videos are returned with a low similarity score.
Step 9: Performing images-to-video search
You can use image-to-video search to retrieve video segments that are visually similar to a given image. By using TwelveLabs’ image embedding capabilities and OpenSearch Service’s vector search functionality, you can match visual features from an image with video embeddings in the index. Here’s how to implement this step:
Generate Image Embeddings
To perform the search, you first convert the input image into a vector representation using the TwelveLabs Embed API:
Key points:
- The Marengo-retrieval-2.7 model is used to generate a dense vector embedding for the input image.
- The embedding captures visual features such as shapes, colors, and patterns, enabling effective matching with video embeddings
Perform vector search in OpenSearch
After the image embedding is generated, you use it as a query vector to perform a k-nearest neighbor (kNN) search in OpenSearch:
Expected output
The following shows video segments retrieved from OpenSearch based on their similarity to the input image.
Observe that image of an ocean was used to search the videos. Video clips from the island video are retrieved with a higher similarity score in the first 2 results.
Clean up
To avoid charges, delete resources created while following this post. For Amazon OpenSearch Service domains, navigate to the AWS Management Console for Amazon OpenSearch Service dashboard and delete the domain.
Conclusion
The integration of TwelveLabs Embed API with OpenSearch Service provides a cutting-edge solution for advanced video search and analysis, unlocking new possibilities for content discovery and insights. By using TwelveLabs’ multimodal embeddings, which capture the intricate interplay of visual, audio, and textual elements in videos, and combining them with OpenSearch Service’s robust vector search capabilities, this solution enables highly nuanced and contextually relevant video search.
As industries increasingly rely on video content for communication, education, marketing, and research, this advanced search solution becomes indispensable. It empowers businesses to extract hidden insights from their video content, enhance user experiences in video-centric applications and make data-driven decisions based on comprehensive video analysis
This integration not only addresses current challenges in managing video content but also lays the foundation for future innovations in how we interact with and derive value from video data.
Get started
Ready to explore the power of TwelveLabs Embed API? Start your free trial today by visiting TwelveLabs Playground to sign up and receive your API key.
For developers looking to implement this solution, follow our detailed step-by-step guide on GitHub to integrate TwelveLabs Embed API with OpenSearch Service and build your own advanced video search application.
Unlock the full potential of your video content today!
About the Authors
James Le runs the Developer Experience function at TwelveLabs. He works with partners, developers, and researchers to bring state-of-the-art video foundation models to various multimodal video understanding use cases.
Gitika is an Senior WW Data & AI Partner Solutions Architect at Amazon Web Services (AWS). She works with partners on technical projects, providing architectural guidance and enablement to build their analytics practice.
Kruthi is a Senior Partner Solutions Architect specializing in AI and ML. She provides technical guidance to AWS Partners in following best practices to build secure, resilient, and highly available solutions in the AWS Cloud.