Text embeddings help us measure the relatedness of (paragraph of) texts in the context of a LLM.
LLM hosts such as Open AI, provide an API endpoint to fetch the relatedness of given texts. Response of this relatedness API is a list of floats (vectors).
We could very well store vector returned by the LLM API in a column of a database table for persistance.
If we could messaure relatedness of two given texts in the context of a LLM, we could also measure how close these texts with respect to each other by measuring vector distance between them.
Since we could store vectors in a database table, to compare a vector with rows of other vectors from a table, we have to compute vector distance for each row of vector and the input vector.
Is there a way to create index on vector values?, so that every vector comparision could be quicker instead of brute force way of comparing all the rows of vectors in the table?
It’s quite common to create indexes on columns to speedup comparision, thus fetching. Its easy to create indexes on numeric, boolean or string columns.
How do we create an index on a column of vectors?
Enter Aproximate Nearest Neighbor (ANN). ANN helps us to create index on vectors, thus accelerate vector comparision. Let’s look at tools that help us create ANN,
- Libraries for Aproximate Nearest Neighbor such as — annoy
- Redis Search —Vector similarity
- Google Vertex AI — ANN service
- Facebook Faiss
Use case walk through
- Semantic Search with Approximate Nearest Neighbors and Text Embeddings
- Similarities on StackOverflow Questions
Mechanics of ANN — https://www.youtube.com/watch?v=DRbjpuqOsjk