vector database

Single-Module Pipeline: `vector-db`

This document is a walkthrough of how to assemble and use a single-module pipeline that only includes a vector-db module.

Vector databases store and manage data points represented as vectors in multidimensional space, thus enabling efficient searches and analytics based on vector distances. They can be applied in Retrieval-Augmented Generation (RAG), recommendation systems, image and video retrieval based on content similarity, and anomaly detection in large datasets.

Note that this module by itself will not generate a particularly easy-to-use pipeline, given that you must already have NPY files ready to process. We suggest also taking a look at this example pipeline or this example pipeline, which respectively take TXT files and JSON files and enable vector (a.k.a. semantic) search on them.

The document is divided into the following sections:

Pipeline Setup
Required Input Format
Using the Default Model
Using the semantic_search Method
Querying Output Databases Locally

Pipeline Setup

Let's first instantiate a single-module vector-db pipeline.

We use the create_pipeline method for this, passing only the vector-db module name into module_chain.

# create a pipeline with a single vector-db module
pipeline = krixik.create_pipeline(name="modules-vector-db-docs", module_chain=["vector-db"])

Required Input Format

The vector-db module accepts NPY file inputs consisting of single NumPy arrays. Each row in the array is a vector that the vector-db module then indexes for vector search.

Let's take a quick look at a valid input file, and then process it:

# examine contents of input file
import numpy as np

np.load(data_dir + "input/vectors.npy")

array([[0, 1],
       [1, 0],
       [1, 1]])

Using the Default Model

Let's process our test input file using the vector-db module's default (and currently only) model: faiss.

Given that this is the default model, we need not specify model selection through the optional modules argument in the process method.

# process the file with the default model
process_output = pipeline.process(
    local_file_path=data_dir + "input/vectors.npy",  # the initial local filepath where the input file is stored
    local_save_directory=data_dir + "output",  # the local directory that the output file will be saved to
    expire_time=60 * 30,  # process data will be deleted from the Krixik system in 30 minutes
    wait_for_process=True,  # wait for process to complete before returning IDE control to user
    verbose=False,
)  # do not display process update printouts upon running code

The output of this process is printed below. To learn more about each component of the output, review documentation for the process method.

Because the output of this particular module-model pair is a FAISS database file, process_output is "null". However, the output file has been saved to the location noted in the process_output_files key. The file_id of the processed input is used as a filename prefix for the output file.

# nicely print the output of this process
print(json.dumps(process_output, indent=2))

{
  "status_code": 200,
  "pipeline": "modules-vector-db-docs",
  "request_id": "1b5e2995-bb36-4789-a14a-2653642284ca",
  "file_id": "7030acbe-2342-4899-b38e-9501788a0bf9",
  "message": "SUCCESS - output fetched for file_id 7030acbe-2342-4899-b38e-9501788a0bf9.Output saved to location(s) listed in process_output_files.",
  "warnings": [],
  "process_output": null,
  "process_output_files": [
    "../../../data/output/7030acbe-2342-4899-b38e-9501788a0bf9.faiss"
  ]
}

Using the `semantic_search` method

Any pipeline containing a vector-db module preceded by a text-embedder module has access to the semantic_search method. This provides you with the convenient ability to effect semantic queries on the created vector database(s).

As the single-module pipeline created above lacks the text-embedder module, the semantic_search method will not work on it. Review documentation for this pipeline example or this pipeline example, both of which meet the requirements for the method: the former ingests TXT files, and the latter JSON files.

Querying Output Databases Locally

In addition to what's provided by the semantic_search method, you can locally perform queries on the generated vector database whose location is indicated in process_output_files.

Below is a simple function for locally performing vector searches on the above-outputted database.

Note: In order to execute this code you will need to install the FAISS library. Depending on the specs of your local setup, install faiss-cpu or faiss-gpu.

# make sure that you've installed faiss (faiss-cpu or faiss-gpu)
!pip install faiss-cpu
import faiss
import numpy as np
from typing import Tuple


def query_vector_db(query_vector: np.ndarray, k: int, db_file_path: str) -> Tuple[list, list]:
    # read in vector db
    faiss_index = faiss.read_index(db_file_path)

    # perform query
    similarities, indices = faiss_index.search(query_vector, k)
    distances = 1 - similarities
    return distances, indices

Requirement already satisfied: faiss-cpu in /Users/jeremywatt/Desktop/krixik-docs/venv/lib/python3.10/site-packages (1.8.0.post1)
Requirement already satisfied: numpy<2.0,>=1.0 in /Users/jeremywatt/Desktop/krixik-docs/venv/lib/python3.10/site-packages (from faiss-cpu) (1.26.4)
Requirement already satisfied: packaging in /Users/jeremywatt/Desktop/krixik-docs/venv/lib/python3.10/site-packages (from faiss-cpu) (24.1)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

Now query your database using a small sample array with the function above. The results are printed below:

# perform test query using the above query function
original_vectors = np.load(data_dir + "input/vectors.npy")
query_vector = np.array([[0, 1]])
distances, indices = query_vector_db(query_vector, 2, process_output["process_output_files"][0])
print(f"input query vector: {query_vector[0]}")
print(f"closest vector from original: {original_vectors[indices[0][0]]}")
print(f"distance from query to this vector: {distances[0][0]}")
print(f"second closest vector from original: {original_vectors[indices[0][1]]}")
print(f"distance from query to this vector: {distances[0][1]}")

input query vector: [0 1]
closest vector from original: [0 1]
distance from query to this vector: 0.0
second closest vector from original: [1 1]
distance from query to this vector: 0.2928932309150696