Parasail supports the OpenAI batch API, including the format of the input and output files.
A batch input file is a .jsonl file where each line is a batch request. Parasail processes these requests and returns the results as an output .jsonl file.
Batch input file
The batch input wraps the same data structures used by interactive requests into an offline format. Each line in the input file is one request.
The request is a JSON dictionary with the following keys:
custom_id: A unique value that the user creates so they can later match outputs to inputs.
method: HTTP method, currently POST.
url: one of /v1/chat/completions, /v1/embeddings
body: The same as the body of an interactive request.
Embeddings. We strongly recommend using "encoding_format": "base64". See below.
Example batch input files
{"custom_id": "b5b938a55cc349d13f08a2586f96807d", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_completion_tokens": 10, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Say pong."}]}}
{"custom_id": "ad95b85d915346e29b7afa52314e94b8", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_completion_tokens": 1024, "messages": [{"role": "user", "content": "What is the capital of New York?"}]}}
{"custom_id": "c587d5391f4524068bb1d6a4a00b5177", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "max_completion_tokens": 1024, "messages": [{"role": "user", "content": "Write two very funny dad jokes."}], "temperature": 0.5}}
Downloadable example file:
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"encoding_format": "base64", "model": "intfloat/e5-mistral-7b-instruct", "input": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way—in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/embeddings", "body": {"encoding_format": "base64", "model": "intfloat/e5-mistral-7b-instruct", "input": "With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as 65 000 components on a single silicon chip."}}
{"custom_id": "request-3", "method": "POST", "url": "/v1/embeddings", "body": {"encoding_format": "base64", "model": "intfloat/e5-mistral-7b-instruct", "input": "We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence."}}
For embeddings, we strongly recommend using "encoding_format": "base64" (link). This setting asks the server to return the result as a Base64-encoded binary array instead of a plain-text float array. This typically reduces file size by nearly 4x with no loss of precision, and is used by default by the interactive OpenAI client.
Limits
Both Parasail and OpenAI limit the maximum size of the batch input files:
Up to 50,000 requests (lines)
Up to 100MB total input file size
Workloads exceeding these limits must be split into multiple batches.
Batch output file
The batch output file wraps the interactive responses into an offline format. Each line in the output file is the response to one request.
Important: The order of responses may differ from the order of requests in the input file
The response is a JSON dictionary. The most important keys:
custom_id: The same as the value in the request. Use this to match responses to requests.
response: The HTTP response, as a dictionary:
status_code: HTTP status code. 200 if the request succeeded, else the same HTTP error code as if this request was interactive.
body: The same as the body of an interactive response.
It is easy to create a batch input file. Exactly how depends on how your workflow. The following code snippets get you started.
import json
prompts = ["What is the capital of New York?", "Write two very funny dad jokes.", "Say pong."]
with open("example_batch_input.jsonl", "w") as file:
for i, prompt in enumerate(prompts):
file.write(
json.dumps(
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
# same request body as for an interactive /v1/chat/completions request
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"max_completion_tokens": 1024,
"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}],
},
}
)
+ "\n"
)