Embeddings and Vector Databases With ChromaDB

Embeddings and Vector Databases With ChromaDB
by:
blow post content copied from  Real Python
click here to view original post


The era of large language models (LLMs) is here, bringing with it rapidly evolving libraries like ChromaDB that help augment LLM applications. You’ve most likely heard of chatbots like OpenAI’s ChatGPT, and perhaps you’ve even experienced their remarkable ability to reason about natural language processing (NLP) problems.

Modern LLMs, while imperfect, can accurately solve a wide range of problems and provide correct answers to many questions. But, due to the limits of their training and the number of text tokens they can process, LLMs aren’t a silver bullet for all tasks.

You wouldn’t expect an LLM to provide relevant responses about topics that don’t appear in their training data. For example, if you asked ChatGPT to summarize information in confidential company documents, then you’d be out of luck. You could show some of these documents to ChatGPT, but there’s a limited number of documents that you can upload before you exceed ChatGPT’s maximum number of tokens. How would you select documents to show ChatGPT?

To address these shortcomings and scale your LLM applications, one great option is to use a vector database like ChromaDB. A vector database allows you to store encoded unstructured objects, like text, as lists of numbers that you can compare to one another. You can, for example, find a collection of documents relevant to a question that you want an LLM to answer.

In this tutorial, you’ll learn about:

  • Representing unstructured objects with vectors
  • Using word and text embeddings in Python
  • Harnessing the power of vector databases
  • Encoding and querying over documents with ChromaDB
  • Providing context to LLMs like ChatGPT with ChromaDB

After reading, you’ll have the foundational knowledge to use ChromaDB in your NLP or LLM applications. Before reading, you should be comfortable with the basics of Python and high school math.

Represent Data as Vectors

Before diving into embeddings and vector databases, you should understand what vectors are and what they represent. Feel free to skip ahead to the next section if you’re already comfortable with vector concepts. If you’re not or if you could use a refresher, then keep reading!

Vector Basics

You can describe vectors with variable levels of complexity, but one great starting place is to think of a vector as an array of numbers. For example, you could represent vectors using NumPy arrays as follows:

Python
>>> import numpy as np

>>> vector1 = np.array([1, 0])
>>> vector2 = np.array([0, 1])
>>> vector1
array([1, 0])

>>> vector2
array([0, 1])

In this code block, you import numpy and create two arrays, vector1 and vector2, representing vectors. This is one of the most common and useful ways to work with vectors in Python, and NumPy offers a variety of functionality to manipulate vectors. There are also several other libraries that you can use to work with vector data, such as PyTorch, TensorFlow, JAX, and Polars. You’ll stick with NumPy for this overview.

You’ve created two NumPy arrays that represent vectors. Now what? It turns out you can do a lot of cool things with vectors, but before continuing on, you’ll need to understand some key definitions and properties:

  • Dimension: The dimension of a vector is the number of elements that it contains. In the example above, vector1 and vector2 are both two-dimensional since they each have two elements. You can only visualize vectors with three dimensions or less, but generally, vectors can have any number of dimensions. In fact, as you’ll see later, vectors that encode words and text tend to have hundreds or thousands of dimensions.

  • Magnitude: The magnitude of a vector is a non-negative number that represents the vector’s size or length. You can also refer to the magnitude of a vector as the norm, and you can denote it with ||v|| or |v|. There are many different definitions of magnitude or norm, but the most common is the Euclidean norm or 2-norm. You’ll learn how to compute this later.

  • Unit vector: A unit vector is a vector with a magnitude of one. In the example above, vector1 and vector2 are unit vectors.

  • Direction: The direction of a vector specifies the line along which the vector points. You can represent direction using angles, unit vectors, or coordinates in different coordinate systems.

  • Dot product (scalar product): The dot product of two vectors, u and v, is a number given by u ⋅ v = ||u|| ||v|| cos(θ), where θ is the angle between the two vectors. Another way to compute the dot product is to do an element-wise multiplication of u and v and sum the results. The dot product is one of the most important and widely used vector operations because it measures the similarity between two vectors. You’ll see more of this later on.

  • Orthogonal vectors: Vectors are orthogonal if their dot product is zero, meaning that they’re at a 90 degree angle to each other. You can think of orthogonal vectors as being completely unrelated to each other.

  • Dense vector: A vector is considered dense if most of its elements are non-zero. Later on, you’ll see that words and text are most usefully represented with dense vectors because each dimension encodes meaningful information.

While there are many more definitions and properties to learn, these six are most important for this tutorial. To solidify these ideas with code, check out the following block. Note that for the rest of this tutorial, you’ll use v1, v2, and v3 to name your vectors:

Python
>>> import numpy as np

>>> v1 = np.array([1, 0])
>>> v2 = np.array([0, 1])
>>> v3 = np.array([np.sqrt(2), np.sqrt(2)])

>>> # Dimension
>>> v1.shape
(2,)

>>> # Magnitude
>>> np.sqrt(np.sum(v1**2))
1.0
>>> np.linalg.norm(v1)
1.0

>>> np.linalg.norm(v3)
2.0

>>> # Dot product
>>> np.sum(v1 * v2)
0

>>> v1 @ v3
1.4142135623730951

You first import numpy and create the arrays v1, v2, and v3. Calling v1.shape shows you the dimension of v1. You then see two different ways to compute the magnitude of a NumPy array. The first, np.sqrt(np.sum(v1**2)), uses the Euclidean norm that you learned about above. The second computation uses np.linalg.norm(), a NumPy function that computes the Euclidean norm of an array by default but can also compute other matrix and vector norms.

Lastly, you see two ways to calculate the dot product between two vectors. Using np.sum(v1 * v2) first computes the element-wise multiplication between v1 and v2 in a vectorized fashion, and you sum the results to produce a single number. A better way to compute the dot product is to use the at-operator (@), as you see with v1 @ v3. This is because @ can perform both vector and matrix multiplications, and the syntax is cleaner.

While all of these vector definitions and properties may seem straightforward to compute, you might still be wondering what they actually mean and why they’re important to understand. One way to better understand vectors is to visualize them in two dimensions. In this context, you can represent vectors as arrows, like in the following plot:

Vectors are often thought of as arrows in two dimensions
Representing vectors as arrows in two dimensions

The above plot shows the visual representation of the vectors v1, v2, and v3 that you worked with in the last example. The tail of each vector arrow always starts at the origin, and the tip is located at the coordinates specified by the vector. As an example, the tip of v1 lies at (1, 0), and the tip of v3 lies at roughly (1.414, 1.414). The length of each vector arrow corresponds to the magnitude that you calculated earlier.

From this visual, you can make the following key inferences:

  1. v1 and v2 are unit vectors because their magnitude, given by the arrow length, is one. v3 isn’t a unit vector, and its magnitude is two, twice the size of v1 and v2.

  2. v1 and v2 are orthogonal because their tails meet at a 90 degree angle. You see this visually but can also verify it computationally by computing the dot product between v1 and v2. By using the dot product definition, v1 ⋅ v2 = ||v1|| ||v2|| cos(θ), you can see that when θ = 90, cos(θ) = 0 and v1 ⋅ v2 = 0. Intuitively, you can think of v1 and v2 as being totally unrelated or having nothing to do with each other. This will become important later.

  3. v3 makes a 45 degree angle with both v1 and v2. This means that v3 will have a non-zero dot product with v1 and v2. This also means that v3 is equally related to both v1 and v2. In general, the smaller the angle between two vectors, the more they point toward a common direction.

Read the full article at https://realpython.com/chromadb-vector-database/ »


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]


November 15, 2023 at 07:30PM
Click here for more details...

=============================
The original post is available in Real Python by
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce