Retrieval-Augmented Generation (RAG) has emerged as a transformative approach in AI, addressing key limitations of traditional large language models (LLMs). By combining the generative capabilities of LLMs with dynamic access to external knowledge sources, RAG systems deliver more accurate, contextual, and up-to-date responses. This advancement has particular significance in domains requiring specialized knowledge or real-time information, from technical documentation to financial analysis.
Understanding Basic RAG Architecture
At its core, RAG operates through a three-stage process:
Document vectorization: Converting chunked documents into embeddings stored in a vector database
Similarity-based retrieval: Matching query vectors with the most relevant document vectors using metrics like cosine similarity
Context-enhanced generation: Combining the original query with retrieved content to generate more informed responses
The effectiveness of a RAG system hinges on both retrieval accuracy and generation quality. While traditional metrics like precision, recall, and F1 scores evaluate retrieval performance, techniques such as LLM-based evaluation and similarity scoring assess the quality of generated responses. In this document, we focus on how to create and evaluate a vector DB on given datasets.
Document vectorization
You use an embedding model to vectorize a document. If the document is large, you may want to chunk the document using various chunking techniques, especially if the document size is larger than the maximum context length supported by the embedding model.
If you have a lot of documents or document chunks, it's more efficient to run them through the embedding model in a batch. You can use the Parasail Batch APIs to achieve this. The following notebook code shows the process.
We use for this experiment:
with open("datasets/anthropic_docs.json", "r") as f:
dataset = json.load(f)
print(f"Number of documents: {len(dataset)}")
print(f"Number of chunks: {len(dataset[0]['chunks'])}")
print(f"First chunk:\n{dataset[0]['chunks'][15]['chunk_content'][:100]}")
Output:
Number of documents: 1
Number of chunks: 232
First chunk:
Model options
Enterprise use cases often mean complex needs and edge cases. Anthropic offers a ran
We define a function that converts a dataset into a vector DB:
def create_vector_db(dataset: List[Dict[str, Any]], embedding_model_name: str) -> None:
# A list of all the chunk_content items
texts_to_embed = []
# A list of chunk metadata
metadata_list = []
for doc in dataset:
for chunk in doc["chunks"]:
texts_to_embed.append(f"{chunk['chunk_content']}")
metadata_list.append({
"chunk_id": chunk["chunk_id"],
"chunk_content": chunk["chunk_content"],
})
# Run embeddings on all the chunks
embeddings, total_num_tokens = run_parasail_batch(
texts_to_embed, embedding_model_name
)
vector_db = {
"embeddings": embeddings,
"metadata": metadata_list,
"embedding_model_name": embedding_model_name,
}
print("Vector DB is created")
print("Number of embedding vectors in DB: ", len(vector_db["embeddings"]))
print(f"Number of embeddings tokens: {total_num_tokens}")
return vector_db
We now define a function that computes embeddings of a list of texts using Parasail Batch API.
def run_parasail_batch(texts_to_embed: List[str], embedding_model_name: str):
client = OpenAI(
base_url="https://api.saas.parasail.io/v1",
api_key=os.getenv("PARASAIL_BATCH_API_KEY"),
)
# Make an embedding request from each string of texts_to_embed, and store all
# the requests in a file. The format of this file follows OpenAI batch inputs format.
batch_input_file_name = "batch_input_file.jsonl"
with open(batch_input_file_name, "w") as batch_input_file:
for i, text in enumerate(texts_to_embed):
batch_input_file.write(
json.dumps(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/embeddings",
"body": {
"model": embedding_model_name,
"input": text,
},
}
)
+ "\n"
)
# Upload the batch input file to Parasail.
batch_input_file = client.files.create(
file=open(batch_input_file_name, "rb"), purpose="batch"
)
# Create a batch job for the uploaded file.
batch_obj = client.batches.create(
input_file_id=batch_input_file.id,
completion_window="24h",
endpoint="/v1/embeddings",
)
# Wait for the batch job to complete.
while batch_obj.status != "completed":
if batch_obj.status in ["failed", "expired", "cancelling", "cancelled"]:
raise RuntimeError(
f"Parasail batch job failed with status: {batch_obj.status}.\nErrors: {batch_obj.errors}"
)
time.sleep(30)
batch_obj = client.batches.retrieve(batch_obj.id)
print(f"Parasail batch job {batch_obj.id} completed successfully")
# Get the output of the batch job.
batch_output = client.files.content(batch_obj.output_file_id).content
# The order of results in batch_output is not guaranteed to match the order of requests
# in batch_input_file, thus we store outputs in a dictionary using custom_id as key.
batch_output_dict = {}
for line in batch_output.splitlines():
line = json.loads(line)
batch_output_dict[line["custom_id"]] = line
# Parse the batch output to get the embeddings and the total number of tokens.
embeddings = []
total_num_tokens = 0
for i in range(len(texts_to_embed)):
response_body = batch_output_dict[f"request-{i}"]["response"]["body"]
embed = response_body["data"][0]["embedding"]
embeddings.append(embed)
num_tokens = response_body["usage"]["prompt_tokens"]
total_num_tokens += num_tokens
return embeddings, total_num_tokens
We can now create a vector DB from the our dataset using an open source embedding model.
Batch job batch-aqjf7etmdt completed successfully
Vector DB is created
Number of embedding vectors in DB: 232
Number of embeddings tokens: 217826
Evaluating the retrieval performance of vector DBs
Now we want to measure the quality of our vector DB using some retrieval metrics.
We use an evaluation dataset that consists of 100 query/answer pairs. The chunks that are relevant to each query are listed along with the answer.
Let's look at one of the queries and its relevant chunks.
with open("datasets/anthropic_docs_eval.json", "r") as f:
dataset = json.load(f)
print(f"Number of queries: {len(dataset)}")
print("First eval item:")
print(f" query: {dataset[0]['query']}")
print(f" answer: {dataset[0]['answer']}")
print(f" chunk_ids: {dataset[0]['chunk_ids']}")
Number of queries: 100
First eval item:
query: How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?
answer: To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios.
chunk_ids: ['https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases', 'https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases']
We now define a function that takes the evaluation dataset and the vector DB as arguments, and returns a number of retrieval metrics. We first retrieve the most similar chunks for each query using the vector DB, and then compare the retrieved chunks with the golden chunks listed in the evaluation dataset.
def evaluate(dataset: List[Dict[str, Any]], vector_db: Dict[str, Any], top_k: int):
# Collect the queries and prepend an instruction to each query.
# The format of the instruction is from https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct
queries = [
f"Instruct: Represent this query for retrieving supporting documents\nQuery: {q['query']}"
for q in dataset
]
# Run embeddings on the queries.
query_embeddings, total_num_tokens = run_parasail_batch(
queries, vector_db["embedding_model_name"]
)
print(f"Number of tokens used for embedding queries: {total_num_tokens}")
# Compute similarities between queries and vector DB embeddings.
query_embeddings = np.array(query_embeddings)
db_embeddings = np.array(vector_db["embeddings"])
similarities = np.dot(query_embeddings, db_embeddings.T)
# Sort vector DB embeddings according to similarity scores.
top_indices = np.argsort(similarities, axis=1)[:, ::-1]
# Find the retrieved chunks for all the queries.
all_retrieved_chunks = [
[
{
"metadata": vector_db["metadata"][idx],
"similarity": float(similarities[i][idx]),
}
for idx in top_indices[i][:top_k]
]
for i in range(len(queries))
]
# Collect golden chunk IDs for every query.
all_golden_chunks = [query["chunk_ids"] for query in dataset]
# Compute evaluation metrics.
precisions = []
recalls = []
mrrs = []
for i in range(len(queries)):
retrieved_chunks = [
retrieved_chunk["metadata"].get("chunk_id", None)
for retrieved_chunk in all_retrieved_chunks[i][:top_k]
]
golden_chunks = set(all_golden_chunks[i])
true_positives = len(set(retrieved_chunks) & golden_chunks)
precision = true_positives / len(retrieved_chunks) if retrieved_chunks else 0
precisions.append(precision)
recall = true_positives / len(golden_chunks) if golden_chunks else 0
recalls.append(recall)
mrr = 0
for i, link in enumerate(retrieved_chunks, 1):
if link in golden_chunks:
mrr = 1 / i
break
mrrs.append(mrr)
avg_precision = sum(precisions) / len(precisions)
avg_recall = sum(recalls) / len(recalls)
avg_mrr = sum(mrrs) / len(mrrs)
f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall)
results = {
"Average Precision (MAP)": avg_precision,
"Average Recall": avg_recall,
"Average MRR": avg_mrr,
"F1 Score": f1,
}
return results
Let's now evaluate the vector DB.
with open("datasets/anthropic_docs_eval.json", "r") as f:
dataset = json.load(f)
results = evaluate(dataset, vector_db, top_k=5)
for k, v in results.items():
print(f"{k}: {v:.4f}")
Parasail batch job batch-ol5qoa5vm2 completed successfully
Number of tokens used for embedding queries: 3735
Average Precision (MAP): 0.2840
Average Recall: 0.7625
Average MRR: 0.8553
F1 Score: 0.4139