Parasail
  • Welcome
  • Serverless
    • Serverless
    • Available Parameters
  • Dedicated
    • Dedicated Endpoints
    • Speeding up Dedicated Models with Speculative Decoding
    • Deploying private models through HuggingFace Repos
    • Dedicated Endpoint Management API
    • Rate Limits and Limitations
  • Batch
    • Quick start
    • Batch Processing with Private Models
    • Batch file format
    • API Reference
  • Cookbooks
    • Run and Evaluate Any Model
    • Chat Completions
    • RAG
    • Multi-Modal
    • Text-to-Speech with Orpheus TTS models
  • Billing
    • Pricing
    • Billing And Payments
    • Promotions
    • Batch SLA
  • Security and Account Management
    • Data Privacy and Retention
    • Account Management
    • Compliance
  • Resources
    • Silly Tavern Guide
    • Community Engagement
Powered by GitBook
On this page
  • The Evolution of AI Text Generation
  • Understanding Basic RAG Architecture
  • Document vectorization
  • Evaluating the retrieval performance of vector DBs
  1. Cookbooks

RAG

PreviousChat CompletionsNextMulti-Modal

Last updated 1 month ago

The Evolution of AI Text Generation

Retrieval-Augmented Generation (RAG) has emerged as a transformative approach in AI, addressing key limitations of traditional large language models (LLMs). By combining the generative capabilities of LLMs with dynamic access to external knowledge sources, RAG systems deliver more accurate, contextual, and up-to-date responses. This advancement has particular significance in domains requiring specialized knowledge or real-time information, from technical documentation to financial analysis.

Understanding Basic RAG Architecture

At its core, RAG operates through a three-stage process:

  1. Document vectorization: Converting chunked documents into embeddings stored in a vector database

  2. Similarity-based retrieval: Matching query vectors with the most relevant document vectors using metrics like cosine similarity

  3. Context-enhanced generation: Combining the original query with retrieved content to generate more informed responses

The effectiveness of a RAG system hinges on both retrieval accuracy and generation quality. While traditional metrics like precision, recall, and F1 scores evaluate retrieval performance, techniques such as LLM-based evaluation and similarity scoring assess the quality of generated responses. In this document, we focus on how to create and evaluate a vector DB on given datasets.

Document vectorization

You use an embedding model to vectorize a document. If the document is large, you may want to chunk the document using various chunking techniques, especially if the document size is larger than the maximum context length supported by the embedding model.

If you have a lot of documents or document chunks, it's more efficient to run them through the embedding model in a batch. You can use the Parasail Batch APIs to achieve this. The following notebook code shows the process.

We use for this experiment:

with open("datasets/anthropic_docs.json", "r") as f:
    dataset = json.load(f)

print(f"Number of documents: {len(dataset)}")
print(f"Number of chunks: {len(dataset[0]['chunks'])}")
print(f"First chunk:\n{dataset[0]['chunks'][15]['chunk_content'][:100]}")

Output:

Number of documents: 1
Number of chunks: 232
First chunk:
Model options


Enterprise use cases often mean complex needs and edge cases. Anthropic offers a ran

We define a function that converts a dataset into a vector DB:

def create_vector_db(dataset: List[Dict[str, Any]], embedding_model_name: str) -> None:
    # A list of all the chunk_content items
    texts_to_embed = []
    # A list of chunk metadata
    metadata_list = []
    for doc in dataset:
        for chunk in doc["chunks"]:
            texts_to_embed.append(f"{chunk['chunk_content']}")
            metadata_list.append({
                "chunk_id": chunk["chunk_id"],
                "chunk_content": chunk["chunk_content"],
            })

    # Run embeddings on all the chunks
    embeddings, total_num_tokens = run_parasail_batch(
        texts_to_embed, embedding_model_name
    )

    vector_db = {
        "embeddings": embeddings,
        "metadata": metadata_list,
        "embedding_model_name": embedding_model_name,
    }
    print("Vector DB is created")
    print("Number of embedding vectors in DB: ", len(vector_db["embeddings"]))
    print(f"Number of embeddings tokens: {total_num_tokens}")
    return vector_db

We now define a function that computes embeddings of a list of texts using Parasail Batch API.

def run_parasail_batch(texts_to_embed: List[str], embedding_model_name: str):
    client = OpenAI(
        base_url="https://api.saas.parasail.io/v1",
        api_key=os.getenv("PARASAIL_BATCH_API_KEY"),
    )

    # Make an embedding request from each string of texts_to_embed, and store all
    # the requests in a file. The format of this file follows OpenAI batch inputs format.
    batch_input_file_name = "batch_input_file.jsonl"
    with open(batch_input_file_name, "w") as batch_input_file:
        for i, text in enumerate(texts_to_embed):
            batch_input_file.write(
                json.dumps(
                    {
                        "custom_id": f"request-{i}",
                        "method": "POST",
                        "url": "/v1/embeddings",
                        "body": {
                            "model": embedding_model_name,
                            "input": text,
                        },
                    }
                )
                + "\n"
            )

    # Upload the batch input file to Parasail.
    batch_input_file = client.files.create(
        file=open(batch_input_file_name, "rb"), purpose="batch"
    )

    # Create a batch job for the uploaded file.
    batch_obj = client.batches.create(
        input_file_id=batch_input_file.id,
        completion_window="24h",
        endpoint="/v1/embeddings",
    )

    # Wait for the batch job to complete.
    while batch_obj.status != "completed":
        if batch_obj.status in ["failed", "expired", "cancelling", "cancelled"]:
            raise RuntimeError(
                f"Parasail batch job failed with status: {batch_obj.status}.\nErrors: {batch_obj.errors}"
            )

        time.sleep(30)
        batch_obj = client.batches.retrieve(batch_obj.id)
    print(f"Parasail batch job {batch_obj.id} completed successfully")

    # Get the output of the batch job.
    batch_output = client.files.content(batch_obj.output_file_id).content

    # The order of results in batch_output is not guaranteed to match the order of requests
    # in batch_input_file, thus we store outputs in a dictionary using custom_id as key.
    batch_output_dict = {}
    for line in batch_output.splitlines():
        line = json.loads(line)
        batch_output_dict[line["custom_id"]] = line

    # Parse the batch output to get the embeddings and the total number of tokens.
    embeddings = []
    total_num_tokens = 0
    for i in range(len(texts_to_embed)):
        response_body = batch_output_dict[f"request-{i}"]["response"]["body"]
        embed = response_body["data"][0]["embedding"]
        embeddings.append(embed)

        num_tokens = response_body["usage"]["prompt_tokens"]
        total_num_tokens += num_tokens

    return embeddings, total_num_tokens

We can now create a vector DB from the our dataset using an open source embedding model.

vector_db = create_vector_db(dataset, "Alibaba-NLP/gte-Qwen2-7B-instruct")

Output:

Batch job batch-aqjf7etmdt completed successfully

Vector DB is created
Number of embedding vectors in DB:  232
Number of embeddings tokens: 217826

Evaluating the retrieval performance of vector DBs

Now we want to measure the quality of our vector DB using some retrieval metrics.

We use an evaluation dataset that consists of 100 query/answer pairs. The chunks that are relevant to each query are listed along with the answer.

Let's look at one of the queries and its relevant chunks.

with open("datasets/anthropic_docs_eval.json", "r") as f:
    dataset = json.load(f)

print(f"Number of queries: {len(dataset)}")
print("First eval item:")
print(f"    query: {dataset[0]['query']}")
print(f"    answer: {dataset[0]['answer']}")
print(f"    chunk_ids: {dataset[0]['chunk_ids']}")
Number of queries: 100
First eval item:
    query: How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?
    answer: To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios.
    chunk_ids: ['https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases', 'https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases']

We now define a function that takes the evaluation dataset and the vector DB as arguments, and returns a number of retrieval metrics. We first retrieve the most similar chunks for each query using the vector DB, and then compare the retrieved chunks with the golden chunks listed in the evaluation dataset.

def evaluate(dataset: List[Dict[str, Any]], vector_db: Dict[str, Any], top_k: int):
    # Collect the queries and prepend an instruction to each query.
    # The format of the instruction is from https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct
    queries = [
        f"Instruct: Represent this query for retrieving supporting documents\nQuery: {q['query']}"
        for q in dataset
    ]

    # Run embeddings on the queries.
    query_embeddings, total_num_tokens = run_parasail_batch(
        queries, vector_db["embedding_model_name"]
    )
    print(f"Number of tokens used for embedding queries: {total_num_tokens}")

    # Compute similarities between queries and vector DB embeddings.
    query_embeddings = np.array(query_embeddings)
    db_embeddings = np.array(vector_db["embeddings"])
    similarities = np.dot(query_embeddings, db_embeddings.T)
    # Sort vector DB embeddings according to similarity scores.
    top_indices = np.argsort(similarities, axis=1)[:, ::-1]

    # Find the retrieved chunks for all the queries.
    all_retrieved_chunks = [
        [
            {
                "metadata": vector_db["metadata"][idx],
                "similarity": float(similarities[i][idx]),
            }
            for idx in top_indices[i][:top_k]
        ]
        for i in range(len(queries))
    ]

    # Collect golden chunk IDs for every query.
    all_golden_chunks = [query["chunk_ids"] for query in dataset]

    # Compute evaluation metrics.
    precisions = []
    recalls = []
    mrrs = []
    for i in range(len(queries)):
        retrieved_chunks = [
            retrieved_chunk["metadata"].get("chunk_id", None)
            for retrieved_chunk in all_retrieved_chunks[i][:top_k]
        ]
        golden_chunks = set(all_golden_chunks[i])

        true_positives = len(set(retrieved_chunks) & golden_chunks)

        precision = true_positives / len(retrieved_chunks) if retrieved_chunks else 0
        precisions.append(precision)

        recall = true_positives / len(golden_chunks) if golden_chunks else 0
        recalls.append(recall)

        mrr = 0
        for i, link in enumerate(retrieved_chunks, 1):
            if link in golden_chunks:
                mrr = 1 / i
                break
        mrrs.append(mrr)

    avg_precision = sum(precisions) / len(precisions)
    avg_recall = sum(recalls) / len(recalls)
    avg_mrr = sum(mrrs) / len(mrrs)
    f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall)
    results = {
        "Average Precision (MAP)": avg_precision,
        "Average Recall": avg_recall,
        "Average MRR": avg_mrr,
        "F1 Score": f1,
    }
    return results

Let's now evaluate the vector DB.

with open("datasets/anthropic_docs_eval.json", "r") as f:
    dataset = json.load(f)

results = evaluate(dataset, vector_db, top_k=5)
for k, v in results.items():
    print(f"{k}: {v:.4f}")
Parasail batch job batch-ol5qoa5vm2 completed successfully
Number of tokens used for embedding queries: 3735
Average Precision (MAP): 0.2840
Average Recall: 0.7625
Average MRR: 0.8553
F1 Score: 0.4139

Anthropic documentation dataset