Mistral AI models

With Mistral AI models on Vertex AI, you can access fully managed and serverless models as APIs. To use a Mistral AI model on Vertex AI, you send a request directly to the Vertex AI API endpoint. Since Mistral AI models use a managed API, you don't need to provision or manage any infrastructure.

This document provides an overview of the available Mistral AI models on Vertex AI and how to use them. It covers the following topics:

To reduce the perception of latency for end-users, you can stream responses. A streamed response uses server-sent events (SSE) to incrementally stream the response.

Mistral AI models are priced on a pay-as-you-go basis. For pricing details, see Mistral AI model pricing on the Vertex AI pricing page.

Model Description Use Case
Mistral OCR (25.05) An Optical Character Recognition (OCR) API for document understanding. Extracting text and understanding complex document elements like tables, charts, and mathematical expressions, especially for RAG systems.
Mistral Small 3.1 (25.03) A versatile and efficient model with multimodal capabilities and a large context window. Low-latency applications requiring chat, instruction following, programming, and understanding long documents.
Mistral Large (24.11) A powerful, multilingual model with advanced reasoning and agent-like capabilities. Complex tasks requiring agentic behavior, function calling, multilingual support, and advanced coding or reasoning.
Codestral (25.01) A specialized model designed for code generation and completion. Code generation, completion, and explanation across 80+ programming languages; building AI applications for developers.

Available Mistral AI models

The following Mistral AI models are available for use in Vertex AI. To access a model, go to its model card in the Model Garden.

Mistral OCR (25.05)

Mistral OCR (25.05) is an Optical Character Recognition API for document understanding. It is designed to understand complex document elements, including interleaved imagery, mathematical expressions, tables, and advanced layouts such as LaTeX formatting. The model can provide a deeper understanding of rich documents such as scientific papers with charts, graphs, equations, and figures.

You can use Mistral OCR (25.05) in combination with a RAG system that takes multimodal documents (such as slides or complex PDFs) as input.

You can also couple Mistral OCR (25.05) with other Mistral models to reformat the results. This combination can help you present the extracted content in a structured and coherent format that is suitable for various downstream applications and analyses.

Go to the Mistral OCR (25.05) model card

Mistral Small 3.1 (25.03)

Mistral Small 3.1 (25.03) features multimodal capabilities and a context of up to 128,000 tokens. The model can process and understand visual inputs and long documents, expanding its range of applications compared to the previous Mistral AI Small model. Mistral Small 3.1 (25.03) is a versatile model designed for various tasks such as programming, mathematical reasoning, document understanding, and dialogue. It is designed for low-latency applications and offers high efficiency compared to other models of similar quality.

Mistral Small 3.1 (25.03) has undergone a full post-training process to align the model with human preferences and needs. This alignment makes the model suitable for applications that require chat or precise instruction following.

Go to the Mistral Small 3.1 (25.03) model card

Mistral Large (24.11)

Mistral Large (24.11) is the latest version of Mistral AI's Large model, with improved reasoning and function calling capabilities.

  • Agent capabilities: Includes built-in function calling and can generate JSON outputs.
  • Multilingual: Supports dozens of languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish.
  • Coding proficiency: Trained on over 80 coding languages, such as Python, Java, C, C++, JavaScript, and Bash, as well as more specific languages like Swift and Fortran.
  • Advanced reasoning: Offers advanced mathematical and reasoning capabilities.

Go to the Mistral Large (24.11) model card

Codestral (25.01)

Codestral (25.01) is designed for code generation tasks. It helps developers write and interact with code through a shared instruction and completion API endpoint. Because it is proficient in code and can converse in a variety of languages, you can use Codestral (25.01) to design advanced AI applications for software developers.

  • Codestral (25.01) is fluent in over 80 programming languages, including Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific languages like Swift and Fortran.
  • Codestral (25.01) can help improve developer productivity and reduce errors by completing coding functions, writing tests, and completing partial code using a fill-in-the-middle mechanism.
  • Codestral (25.01) is a 24B parameter model with a 128,000-token context window, designed for high performance and low latency.

Codestral (25.01) is optimized for the following use cases:

  • Generates code and provides code completion, suggestions, and translation.
  • Adds code between user-defined start and end points, which is useful for tasks that require a specific piece of code to be generated.
  • Summarizes and explains your code.
  • Reviews code quality by refactoring code, fixing bugs, and generating test cases.

Go to the Codestral (25.01) model card

Use Mistral AI models

You can use curl commands to send requests to the Vertex AI endpoint using the following model names:

  • For Mistral OCR (25.05), use mistral-ocr-2505
  • For Mistral Small 3.1 (25.03), use mistral-small-2503
  • For Mistral Large (24.11), use mistral-large-2411
  • For Mistral Nemo, use mistral-nemo
  • For Codestral (25.01), use codestral-2501

For more information about using the Mistral AI SDK, see the Mistral AI Vertex AI documentation.

Before you begin

Before you can use Mistral AI models with Vertex AI, complete the following steps. To use Vertex AI, you must enable the Vertex AI API (aiplatform.googleapis.com). If you already have a project with the Vertex AI API enabled, you can use that project instead of creating a new one.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. Enable the Vertex AI API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Verify that billing is enabled for your Google Cloud project.

  7. Enable the Vertex AI API.

    Enable the API

  8. Go to one of the following Model Garden model cards, then click Enable:

Make a streaming call to a Mistral AI model

The following sample makes a streaming call to a Mistral AI model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: A region that supports Mistral AI models.
  • MODEL: The model name you want to use. In the request body, exclude the @ model version number.
  • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
  • STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
  • CONTENT: The content, such as text, of the user or assistant message.
  • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.

    Specify a lower value for shorter responses and a higher value for potentially longer responses.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict

Request JSON body:

 { "model": MODEL,   "messages": [    {     "role": "ROLE",     "content": "CONTENT"    }],   "max_tokens": MAX_TOKENS,   "stream": true } 

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Make a unary call to a Mistral AI model

The following sample makes a unary call to a Mistral AI model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: A region that supports Mistral AI models.
  • MODEL: The model name you want to use. In the request body, exclude the @ model version number.
  • ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
  • STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
  • CONTENT: The content, such as text, of the user or assistant message.
  • MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.

    Specify a lower value for shorter responses and a higher value for potentially longer responses.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict

Request JSON body:

 { "model": MODEL,   "messages": [    {     "role": "ROLE",     "content": "CONTENT"    }],   "max_tokens": MAX_TOKENS,   "stream": false } 

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Mistral AI model region availability and quotas

For Mistral AI models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.

Model Region Quotas Context length
Mistral OCR (25.05)
us-central1
  • QPM: 30
  • Pages per request: 30 (1 page = 1 million input tokens and 1 million output tokens)
30 pages
europe-west4
  • QPM: 30
  • Pages per request: 30 (1 page = 1 million input tokens and 1 million output tokens)
30 pages
Mistral Small 3.1 (25.03)
us-central1
  • QPM: 60
  • TPM: 200,000
128,000
europe-west4
  • QPM: 60
  • TPM: 200,000
128,000
Mistral Large (24.11)
us-central1
  • QPM: 60
  • TPM: 400,000
128,000
europe-west4
  • QPM: 60
  • TPM: 400,000
128,000
Mistral Nemo
us-central1
  • QPM: 60
  • TPM: 400,000
128,000
europe-west4
  • QPM: 60
  • TPM: 400,000
128,000
Codestral (25.01)
us-central1
  • QPM: 60
  • TPM: 400,000
32,000
europe-west4
  • QPM: 60
  • TPM: 400,000
32,000

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.