Features
Approximate Nearest Neighbor Search
The primary functionality of the vector store is straightforward: identifying the most similar vectors to a given vector. While the concept is simple, translating it into a practical product poses significant challenges.
A simple and basic approach to searching in a vector database is to perform an exhaustive search by comparing a query vector to every other vector stored in the database one by one. However, this consumes too many resources and results in very high latencies, making it not very practical. To address this problem, Approximate Nearest Neighbor (ANN
) algorithms are used. ANN
search approximates the true nearest neighbor, which means it might not find the absolute closest point, but it will find one that’s close enough, with a low-latency and by consuming fewer resources.
In the literature, the comparison of the resuts of ANNS
with exhaustive search is called the recall rate.
The higher the recall rate the better the results.
Several ANNS
algorithms, such as HNSW
[1], NSG
[2], and DiskANN
[3], are available for use,
each with its distinct characteristics. One of the difficult problems in ANN algorithms is that indexing and querying vectors may require storing the whole data in memory. When the dataset is huge, then memory requirements for indexing may exceed available memory. DiskANN
algorithm tries to solve this problem by using disk as the main storage for indexes and for performing queries directly on disk.DiskANN
paper acknowledges that, if you try to store your vectors
in disk and use HNSW
or NSG
, you may end up with again very high latencies. DiskANN
is focused
on serving queries from disk with low-latency and good recall rate.
And this helps Upstash Vector to be cost-effective, therefore cheaper compared to alternatives.
Even though DiskANN
has its advantages, it also requires more work to be practical.
Main problem is that, you can’t insert/update existing index without reindexing all the vectors.
For this problem, there is another improved paper FreshDiskANN
[4]. FreshDiskANN
improves DiskANN
via introducing
a temporary index for up-to-date data in memory. Queries are served from both the temporary(up-to-date) index
and also from the disk. And these temporary indexes are merged to the disk from time-to-time behind the scene.
Upstash Vector is based on DiskANN
and FreshDiskANN
with more improvements based on our
tests and observations.
References
- Malkov, Y. A., Yashunin, D. A. (2016). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. CoRR, abs/1603.09320 (2016). [http://arxiv.org/abs/1603.09320]
- Fu, C., Xiang, C., Wang, C., Cai, D. (2019). Fast Approximate Nearest Neighbor Search with Navigating Spreading-Out Graphs. Proceedings of the VLDB, 12(5), 461–474. doi: 10.14778/3303753.3303754. [http://www.vldb.org/pvldb/vol12/p461-fu.pdf]
- Subramanya, S. J., Devvrit, Kadekodi, R., Krishaswamy, R., Simhadri, H. V. (2019). DiskANN: Fast Accurate Billion-Point Nearest Neighbor Search on a Single Node. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS ‘19), Article No.: 1233, Pages 13766–13776. [https://dl.acm.org/doi/abs/10.5555/3454287.3455520]
- Singh, A., Subramanya, S. J., Krishnaswamy, R., Simhadri, H. V. (2021). FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. CoRR abs/2105.09613 (2021). [https://arxiv.org/abs/2105.09613]
Vector Similarity Functions
When creating a vector index in Upstash Vector, you have the flexibility to choose from different vector similarity functions. Each function yields distinct query results, catering to specific use cases. Here are the three supported similarity functions:
Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors. It is particularly useful when the magnitude of the vectors is not essential, and the focus is on the orientation.
Use Cases:
- Natural Language Processing (NLP): Ideal for comparing document embeddings or word vectors, as it captures semantic similarity irrespective of vector magnitude.
- Recommendation Systems: Effective in recommending items based on user preferences or content similarities.
Score calculation:
(1 + cosine_similarity(v1, v2)) / 2;
Euclidean Distance
Euclidean distance calculates the straight-line distance between two vectors in a multi-dimensional space. It is well-suited for scenarios where the magnitude of vectors is crucial, providing a measure of their spatial separation.
Use Cases:
- Computer Vision: Useful in image processing tasks, such as image recognition or object detection, where the spatial arrangement of features is significant.
- Anomaly Detection: Valuable for detecting anomalies in datasets, as it considers both the direction and magnitude of differences between vectors.
Score calculation:
1 / (1 + squared_distance(v1, v2))
Dot Product
The dot product measures the similarity by multiplying the corresponding components of two vectors and summing the results. It provides a measure of alignment between vectors. Note that to use dot product, the vectors needs to be normalized to be of unit length.
Use Cases:
- Machine Learning Models: Commonly used in machine learning for tasks like sentiment analysis or classification, where feature alignment is critical.
- Collaborative Filtering: Effective in collaborative filtering scenarios, such as recommending items based on user behavior or preferences.
Score calculation:
(1 + dot_product(v1, v2)) / 2
Metadata
Metadata feature allows you to store context with your vectors to make a connection. There can be a couple of uses of this:
- You can put the source of the vector in the metadata to use in your application from the query response.
- You can put some metadata to further filter the results upon the query. We will have more advanced filtering using this metadata soon.
You can set metadata with your vector as follows:
When you do a query or fetch, you can opt-in to retrieve the metadata as follows:
- Query Example
- Range Example