Knowledge Center

Embeddings and Vector Databases

In this tutorial, we will learn about vector stores and Chroma DB, an open-source database for storing and managing embeddings. Moreover, we will learn how to add and remove documents, perform similarity searches, and convert our text into embeddings.

Joel Koh

19 Apr 2024

What Are Vector Stores?

Vector stores are databases explicitly designed for storing and retrieving vector embeddings efficiently. They are needed because traditional databases like SQL are not optimized for storing and querying large vector data.

Embeddings represent data (usually unstructured data like text) in numerical vector formats within a high-dimensional space. Traditional relational databases are not well-suited to storing and searching these vector representations.

Vector stores can index and quickly search for similar vectors using similarity algorithms. It allows applications to find related vectors given a target vector query.

In the case of a personalized chatbot, the user inputs a prompt for the generative AI model. The model then searches for similar text within a collection of documents using a similarity search algorithm. The resulting information is then used to generate a highly personalized and accurate response. It is made possible through embedding and vector indexing within vector stores.

What is Chroma DB?

Chroma DB is an open-source vector store used for storing and retrieving vector embeddings. Its main use is to save embeddings along with metadata to be used later by large language models. Additionally, it can also be used for semantic search engines over text data.

How does Chroma DB work?

First, you have to create a collection similar to the tables in the relations database. By default, Chroma converts the text into the embeddings using all-MiniLM-L6-v2 but you can modify the collection to use another embedding model.

Add text documents to the newly created collection with metadata and a unique ID. When your collection receives the text, it automatically converts it into embedding.

Query the collection by text or embedding to receive similar documents. You can also filter out results based on metadata.

Embeddings

You can use any high-performing embedding model from the embedding list. You can even create your custom embedding functions.

In this section, we will use the line OpenAI embedding model called “text-embedding-ada-002” to convert text into embedding.

After creating the OpenAI embedding function, you can add the list of text documents to generate embeddings.

Discover how to use the OpenAI API for Text Embeddings and create text classifiers, information retrieval systems, and semantic similarity detectors.

Updating and Removing Data

Just like relational databases, you can update or remove the values from the collections. To update the text and metadata, we will provide the specific ID for the record and new text.

Collection Management

In this section, we will learn about the collection utility function that will make our lives much easier.

We will create a new collection called “vectordb” and add the information about the Chroma DB cheat sheet, documentation, and JS API with metadata.

Conclusion

Vector stores like Chroma DB are becoming essential components of large language model systems. By providing specialized storage and efficient retrieval of vector embeddings, they enable fast access to relevant semantic information to power LLMs.

In this Chroma DB tutorial, we covered the basics of creating a collection, adding documents, converting text to embeddings, querying for semantic similarity, and managing the collections.

Joel leads software development at Fin X's Technology Management & Digital Solutions practice. He leads the firm's Asia Pacific office efforts in continually evolving and growing new capabilities to serve clients in technology-led transformations.

Embeddings and Vector Databases

Book a Free Consultation