Batch predictions

This guide shows you how to get batch predictions from Llama models, covering the following topics:

The following diagram summarizes the overall workflow:

Batch prediction allows you to efficiently send multiple text-only prompts that aren't latency sensitive to a Llama model. Compared to online predictions, where you send one input prompt for each request, you can batch a large number of input prompts in a single request.

There are no charges for batch predictions during the Preview period.

Supported Llama models

Vertex AI supports batch predictions for the following Llama models:

Prepare your input

You can provide your input prompts in a BigQuery table or as a JSONL file in Cloud Storage. The input for both sources must follow the OpenAI API schema JSON format, as shown in the following example:

Input Source Description Use Case
BigQuery Input data is stored in a BigQuery table. Ideal when your data already resides in BigQuery, for large-scale structured datasets, and when you want to leverage SQL for data preparation.
Cloud Storage Input data is stored as a JSONL file in a Cloud Storage bucket. Suitable for unstructured or semi-structured data, when data is coming from various sources, or for simpler, file-based workflows.
 {"custom_id": "test-request-0", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta/llama-3.1-405b-instruct-maas", "messages": [{"role": "system", "content": "You are a chef."}, {"role": "user", "content": "Give me a recipe for banana bread"}], "max_tokens": 1000}} 

BigQuery

Your BigQuery input table must have the following schema:

Column name Description
custom_id An ID for each request to match the input with the output.
method The request method.
url The request endpoint.
body(JSON) Your input prompt.
  • Your input table can have other columns. These are ignored by the batch job and passed directly to the output table.
  • The batch prediction job reserves response(JSON) and id as column names for the output. To avoid conflicts, don't use these names for columns in your input table.
  • The batch prediction job drops the method and url columns, so they aren't included in the output table.

Cloud Storage

Your input must be a JSONL file in a Cloud Storage bucket. Each line in the file must be a valid JSON object that follows the required schema.

Request a batch prediction

To request a batch prediction from a Llama model, use input from BigQuery or Cloud Storage. You can choose to output predictions to either a BigQuery table or a JSONL file in a Cloud Storage bucket, regardless of your input source.

BigQuery

Specify your BigQuery input table, model, and output location. The batch prediction job and your table must be in the same region.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: A region that supports Llama models.
  • PROJECT_ID: Your project ID.
  • MODEL: The name of the model to tune.
  • INPUT_URI: The BigQuery table where your batch prediction input is located such as myproject.mydataset.input_table.
  • OUTPUT_FORMAT: To output to a BigQuery table, specify bigquery. To output to a Cloud Storage bucket, specify jsonl.
  • DESTINATION: For BigQuery, specify bigqueryDestination. For Cloud Storage, specify gcsDestination.
  • OUTPUT_URI_FIELD_NAME: For BigQuery, specify outputUri. For Cloud Storage, specify outputUriPrefix.
  • OUTPUT_URI: For BigQuery, specify the table location such as myproject.mydataset.output_result. For Cloud Storage, specify the bucket and folder location such as gs://mybucket/path/to/outputfile.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs

Request JSON body:

 '{   "displayName": "JOB_NAME",   "model": "publishers/meta/models/MODEL",   "inputConfig": {     "instancesFormat":"bigquery",     "bigquerySource":{       "inputUri" : "INPUT_URI"     }   },   "outputConfig": {     "predictionsFormat":"OUTPUT_FORMAT",     "DESTINATION":{       "OUTPUT_URI_FIELD_NAME": "OUTPUT_URI"     }   } }' 

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Cloud Storage

Specify your JSONL file's Cloud Storage location, model, and output location.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: A region that supports Llama models.
  • PROJECT_ID: Your project ID.
  • MODEL: The name of the model to tune.
  • INPUT_URI: The Cloud Storage location of your JSONL batch prediction input such as gs://bucketname/path/to/jsonl.
  • OUTPUT_FORMAT: To output to a BigQuery table, specify bigquery. To output to a Cloud Storage bucket, specify jsonl.
  • DESTINATION: For BigQuery, specify bigqueryDestination. For Cloud Storage, specify gcsDestination.
  • OUTPUT_URI_FIELD_NAME: For BigQuery, specify outputUri. For Cloud Storage, specify outputUriPrefix.
  • OUTPUT_URI: For BigQuery, specify the table location such as myproject.mydataset.output_result. For Cloud Storage, specify the bucket and folder location such as gs://mybucket/path/to/outputfile.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs

Request JSON body:

 '{   "displayName": "JOB_NAME",   "model": "publishers/meta/models/MODEL",   "inputConfig": {     "instancesFormat":"jsonl",     "gcsDestination":{       "uris" : "INPUT_URI"     }   },   "outputConfig": {     "predictionsFormat":"OUTPUT_FORMAT",     "DESTINATION":{       "OUTPUT_URI_FIELD_NAME": "OUTPUT_URI"     }   } }' 

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Get the status of a batch prediction job

After you submit your request, you can get the status of the batch prediction job to check if it's complete. The time it takes for the job to complete depends on the number of input items that you submitted.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region where your batch job is located.
  • JOB_ID: The batch job ID that was returned when you created the job.

HTTP method and URL:

GET https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs/JOB_ID

To send your request, choose one of these options:

curl

Execute the following command:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs/JOB_ID"

PowerShell

Execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/batchPredictionJobs/JOB_ID" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Retrieve your output

After the batch prediction job completes, the output is saved to the destination that you specified in the request.

  • BigQuery: The output is in the response(JSON) column of your destination table.
  • Cloud Storage: The output is saved as a JSONL file in your destination bucket.