By Claude on 2024-08-26

Implementing HNSW for Efficient Similarity Search in Go

Introduction

In the world of machine learning and information retrieval, finding similar items quickly in high-dimensional spaces is a common and challenging problem. The Hierarchical Navigable Small World (HNSW) algorithm has emerged as a powerful solution for approximate nearest neighbor search, offering an excellent balance between search speed and accuracy.

In this blog post, we’ll dive deep into a Go implementation of HNSW, exploring its structure, key components, and usage. Whether you’re building a recommendation system, a image similarity search engine, or any application requiring fast similarity lookups, this HNSW implementation can significantly boost your performance.

Understanding HNSW

HNSW is a graph-based algorithm that creates a multi-layer structure for efficient similarity search. The key ideas behind HNSW are:

Hierarchical Structure: The algorithm builds multiple layers of graphs, with fewer nodes in higher layers.
Small World Property: Each layer is constructed to have the “small world” property, allowing for quick navigation.
Navigable Graphs: The structure allows for efficient “zooming in” on the target area as the search progresses from top to bottom layers.

Core Components

Let’s break down the main components of our HNSW implementation:

Vector and Distance Function

type Vector []float64

type DistanceFunc func(Vector, Vector) float64

The Vector type represents our data points, and DistanceFunc is a function type for calculating distances between vectors. This allows flexibility in choosing distance metrics based on your specific use case.

Node Structure

type Node struct {
    ID       int
    Vector   Vector
    Levels   []*Level
    MaxLevel int
    sync.RWMutex
}

type Level struct {
    Connections []*Node
}

Each Node represents a data point in our HNSW structure. It contains:

An ID for identification
The actual vector data
A slice of levels, each containing connections to other nodes
A read-write mutex for thread-safe operations

HNSW Structure

type HNSW struct {
    Nodes           map[int]*Node
    EntryPoint      *Node
    MaxLevel        int
    M               int // max number of connections
    Mmax            int // max number of connections for level 0
    EfConstruction  int
    Dim             int
    DistanceFunc    DistanceFunc
    mutex           sync.RWMutex
    deletedNodes    map[int]bool
}

The HNSW struct is the heart of our implementation. It manages the entire index structure and contains:

A map of all nodes
The entry point for searches
Configuration parameters like maximum connections and construction factor
The chosen distance function
Mutexes for thread-safety
A map to track deleted nodes

Key Operations

Insertion

The Insert method adds a new vector to the HNSW structure:

It creates a new node for the vector.
If it’s the first node, it becomes the entry point.
Otherwise, it traverses the structure from top to bottom to find the best insertion point.
It then connects the new node to its neighbors at each level up to a randomly chosen maximum level.

func (h *HNSW) Insert(id int, vec Vector) {
    // ... (implementation details)
}

Search

The Search method finds the k nearest neighbors of a given vector:

It starts from the entry point at the top level.
At each level, it greedily moves to closer neighbors.
At the bottom level, it performs a more thorough local search.
Finally, it returns the k closest nodes found.

func (h *HNSW) Search(vec Vector, k int) []int {
    // ... (implementation details)
}

Deletion

The Delete method marks a node as deleted:

It marks the node as deleted in the deletedNodes map.
It removes the node from the main Nodes map.
If the deleted node was the entry point, it chooses a new entry point.

func (h *HNSW) Delete(id int) {
    // ... (implementation details)
}

Persistence

The implementation includes methods to save and load the HNSW structure to/from a file:

func (h *HNSW) Save(filename string) error {
    // ... (implementation details)
}

func LoadHNSW(filename string) (*HNSW, error) {
    // ... (implementation details)
}

This feature is crucial for maintaining your index across application restarts or for distributing pre-built indexes.

Distance Metrics

The implementation provides two common distance metrics:

func EuclideanDistance(v1, v2 Vector) float64 {
    // ... (implementation details)
}

func CosineDistance(v1, v2 Vector) float64 {
    // ... (implementation details)
}

Euclidean distance is suitable for many applications, while cosine distance is often used for tasks like text similarity where the magnitude of vectors is less important than their direction.

Usage Example

Here’s how you might use this HNSW implementation in a Go application:

func main() {
    // Create a new HNSW index with Euclidean distance
    index := NewHNSW(128, 16, 32, 200, EuclideanDistance)
    
    // Insert vectors concurrently
    var wg sync.WaitGroup
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            vec := generateRandomVector(128)
            index.Insert(id, vec)
        }(i)
    }
    wg.Wait()
    
    // Search for nearest neighbors
    results := index.Search(Vector{...}, 10)
    
    // Delete a vector
    index.Delete(5)
    
    // Save the index to a file
    err := index.Save("hnsw_index.gob")
    if err != nil {
        log.Fatal(err)
    }
    
    // Load the index from a file
    loadedIndex, err := LoadHNSW("hnsw_index.gob")
    if err != nil {
        log.Fatal(err)
    }
    
    // Use a different distance metric
    cosineIndex := NewHNSW(128, 16, 32, 200, CosineDistance)
}

Performance Considerations

While this implementation is feature-rich and thread-safe, there are some areas where performance could be optimized:

Lock Granularity: The current implementation uses a global lock for insertions. For high-concurrency scenarios, implementing more fine-grained locking could improve performance.
Memory Usage: The Node structure stores full vectors. For very large datasets, consider storing vectors separately and keeping only references in the Node structure.
Batch Operations: Implementing batch insert and search operations could improve throughput for certain use cases.
Distance Computation: For high-dimensional data, the distance computation can become a bottleneck. Consider using approximated distance computations or dimensionality reduction techniques.

Conclusion

This HNSW implementation in Go provides a powerful tool for approximate nearest neighbor search in high-dimensional spaces. Its hierarchical structure allows for logarithmic search time complexity, making it suitable for large-scale similarity search applications.

By offering thread-safety, persistence, and flexibility in distance metrics, this implementation can be readily integrated into a wide range of Go applications, from recommendation systems to image retrieval engines.

Remember that while HNSW offers excellent performance, it’s an approximate algorithm. The trade-off between search speed and accuracy can be tuned through parameters like M, Mmax, and EfConstruction. Always benchmark with your specific dataset to find the optimal configuration.

Happy similarity searching!