发送处理请求

设置 Google Cloud 账号创建处理器后,您可以向 Document AI 处理器发送请求。

用于发送请求的代码对于所有处理器都是相同的。您会发现,每个处理器输出的信息中都存在处理器功能方面的差异。

使用 v1 版 Document AI 或在 Google Cloud 控制台中,您可以向该特定处理器版本发送处理请求。如果您未指定处理器版本,则系统会使用默认版本。 如需了解详情,请参阅管理处理器版本

在线处理

通过在线(同步)请求,您可以发送单个文档进行处理。 Document AI 会立即处理请求并返回 document

向数据处理方发送请求

以下代码示例展示了如何向处理器发送请求。

REST

此示例展示了如何在 rawDocument 对象中提供文档内容(以字节为单位的原始文档内容,通过 base64 编码的字符串提供)。

或者,您也可以指定 inlineDocument,这与 Document AI 返回的 Document JSON 格式相同。这样,您便可以通过来回传递相同格式来链接请求(例如,如果您对文档进行分类,然后提取其内容)。

在使用任何请求数据之前,请先进行以下替换:

  • LOCATION:处理器的位置,例如:
    • us - 美国
    • eu - 欧盟
  • PROJECT_ID:您的 Google Cloud 项目 ID。
  • PROCESSOR_ID:自定义处理器的 ID。
  • skipHumanReview:一个用于停用人工审核的布尔值(仅受人机协同处理器支持)。
    • true - 跳过人工审核
    • false - 启用人工审核(默认)
  • MIME_TYPE:有效的 MIME 类型选项之一。
  • IMAGE_CONTENT:有效的内嵌文档内容之一,表示为字节流。对于 JSON 表示形式,二进制图片数据的 base64 编码(ASCII 字符串)。此字符串应类似于以下字符串:
    • /9j/4QAYRXhpZgAA...9tAVx/zDQDlGxn//2Q==
    如需了解详情,请参阅 Base64 编码主题。
  • FIELD_MASK:指定要在 Document 输出中包含哪些字段。这是完全限定字段名称的逗号分隔列表,格式为 FieldMask
    • 示例:text,entities,pages.pageNumber
  • INDIVIDUAL_PAGES:要处理的各个网页的列表。
    • 或者,提供字段 fromStartfromEnd 以处理文档开头或结尾的指定数量的页面。

† 也可以使用 inlineDocument 对象中的 base64 编码内容指定此内容。

HTTP 方法和网址:

POST https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process

请求 JSON 正文:

 {   "skipHumanReview": skipHumanReview,   "rawDocument": {     "mimeType": "MIME_TYPE",     "content": "IMAGE_CONTENT"   },   "fieldMask": "FIELD_MASK",   "processOptions": {     "individualPageSelector" {       "pages": [INDIVIDUAL_PAGES]     }   } } 

如需发送请求,请选择以下方式之一:

curl

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process"

PowerShell

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:process" | Select-Object -Expand Content

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应。响应正文包含一个 Document 实例。

向处理器版本发送请求

在使用任何请求数据之前,请先进行以下替换:

  • LOCATION:处理器的位置,例如:
    • us - 美国
    • eu - 欧盟
  • PROJECT_ID:您的 Google Cloud 项目 ID。
  • PROCESSOR_ID:自定义处理器的 ID。
  • PROCESSOR_VERSION:处理器版本标识符。如需了解详情,请参阅选择处理器版本。例如:
    • pretrained-TYPE-vX.X-YYYY-MM-DD
    • stable
    • rc
  • skipHumanReview:一个用于停用人工审核的布尔值(仅受人机协同处理器支持)。
    • true - 跳过人工审核
    • false - 启用人工审核(默认)
  • MIME_TYPE:有效的 MIME 类型选项之一。
  • IMAGE_CONTENT:有效的内嵌文档内容之一,表示为字节流。对于 JSON 表示形式,二进制图片数据的 base64 编码(ASCII 字符串)。此字符串应类似于以下字符串:
    • /9j/4QAYRXhpZgAA...9tAVx/zDQDlGxn//2Q==
    如需了解详情,请参阅 Base64 编码主题。
  • FIELD_MASK:指定要在 Document 输出中包含哪些字段。这是完全限定字段名称的逗号分隔列表,格式为 FieldMask
    • 示例:text,entities,pages.pageNumber

† 也可以使用 inlineDocument 对象中的 base64 编码内容指定此内容。

HTTP 方法和网址:

POST https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:process

请求 JSON 正文:

 {   "skipHumanReview": skipHumanReview,   "rawDocument": {     "mimeType": "MIME_TYPE",     "content": "IMAGE_CONTENT"   },   "fieldMask": "FIELD_MASK" } 

如需发送请求,请选择以下方式之一:

curl

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:process"

PowerShell

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:process" | Select-Object -Expand Content

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应。响应正文包含一个 Document 实例。

C#

如需了解详情,请参阅 Document AI C# API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

 using Google.Cloud.DocumentAI.V1; using Google.Protobuf; using System; using System.IO;  public class QuickstartSample {     public Document Quickstart(         string projectId = "your-project-id",         string locationId = "your-processor-location",         string processorId = "your-processor-id",         string localPath = "my-local-path/my-file-name",         string mimeType = "application/pdf"     )     {         // Create client         var client = new DocumentProcessorServiceClientBuilder         {             Endpoint = $"{locationId}-documentai.googleapis.com"         }.Build();          // Read in local file         using var fileStream = File.OpenRead(localPath);         var rawDocument = new RawDocument         {             Content = ByteString.FromStream(fileStream),             MimeType = mimeType         };          // Initialize request argument(s)         var request = new ProcessRequest         {             Name = ProcessorName.FromProjectLocationProcessor(projectId, locationId, processorId).ToString(),             RawDocument = rawDocument         };          // Make the request         var response = client.ProcessDocument(request);          var document = response.Document;         Console.WriteLine(document.Text);         return document;     } } 

Java

如需了解详情,请参阅 Document AI Java API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

 import com.google.cloud.documentai.v1.Document; import com.google.cloud.documentai.v1.DocumentProcessorServiceClient; import com.google.cloud.documentai.v1.DocumentProcessorServiceSettings; import com.google.cloud.documentai.v1.ProcessRequest; import com.google.cloud.documentai.v1.ProcessResponse; import com.google.cloud.documentai.v1.RawDocument; import com.google.protobuf.ByteString; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.List; import java.util.concurrent.ExecutionException; import java.util.concurrent.TimeoutException;  public class ProcessDocument {   public static void processDocument()       throws IOException, InterruptedException, ExecutionException, TimeoutException {     // TODO(developer): Replace these variables before running the sample.     String projectId = "your-project-id";     String location = "your-project-location"; // Format is "us" or "eu".     String processerId = "your-processor-id";     String filePath = "path/to/input/file.pdf";     processDocument(projectId, location, processerId, filePath);   }    public static void processDocument(       String projectId, String location, String processorId, String filePath)       throws IOException, InterruptedException, ExecutionException, TimeoutException {     // Initialize client that will be used to send requests. This client only needs     // to be created     // once, and can be reused for multiple requests. After completing all of your     // requests, call     // the "close" method on the client to safely clean up any remaining background     // resources.     String endpoint = String.format("%s-documentai.googleapis.com:443", location);     DocumentProcessorServiceSettings settings =         DocumentProcessorServiceSettings.newBuilder().setEndpoint(endpoint).build();     try (DocumentProcessorServiceClient client = DocumentProcessorServiceClient.create(settings)) {       // The full resource name of the processor, e.g.:       // projects/project-id/locations/location/processor/processor-id       // You must create new processors in the Cloud Console first       String name =           String.format("projects/%s/locations/%s/processors/%s", projectId, location, processorId);        // Read the file.       byte[] imageFileData = Files.readAllBytes(Paths.get(filePath));        // Convert the image data to a Buffer and base64 encode it.       ByteString content = ByteString.copyFrom(imageFileData);        RawDocument document =           RawDocument.newBuilder().setContent(content).setMimeType("application/pdf").build();        // Configure the process request.       ProcessRequest request =           ProcessRequest.newBuilder().setName(name).setRawDocument(document).build();        // Recognizes text entities in the PDF document       ProcessResponse result = client.processDocument(request);       Document documentResponse = result.getDocument();        // Get all of the document text as one big string       String text = documentResponse.getText();        // Read the text recognition output from the processor       System.out.println("The document contains the following paragraphs:");       Document.Page firstPage = documentResponse.getPages(0);       List<Document.Page.Paragraph> paragraphs = firstPage.getParagraphsList();        for (Document.Page.Paragraph paragraph : paragraphs) {         String paragraphText = getText(paragraph.getLayout().getTextAnchor(), text);         System.out.printf("Paragraph text:\n%s\n", paragraphText);       }        // Form parsing provides additional output about       // form-formatted PDFs. You must create a form       // processor in the Cloud Console to see full field details.       System.out.println("The following form key/value pairs were detected:");        for (Document.Page.FormField field : firstPage.getFormFieldsList()) {         String fieldName = getText(field.getFieldName().getTextAnchor(), text);         String fieldValue = getText(field.getFieldValue().getTextAnchor(), text);          System.out.println("Extracted form fields pair:");         System.out.printf("\t(%s, %s))\n", fieldName, fieldValue);       }     }   }    // Extract shards from the text field   private static String getText(Document.TextAnchor textAnchor, String text) {     if (textAnchor.getTextSegmentsList().size() > 0) {       int startIdx = (int) textAnchor.getTextSegments(0).getStartIndex();       int endIdx = (int) textAnchor.getTextSegments(0).getEndIndex();       return text.substring(startIdx, endIdx);     }     return "[NO TEXT]";   } }

Node.js

如需了解详情,请参阅 Document AI Node.js API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

/**  * TODO(developer): Uncomment these variables before running the sample.  */ // const projectId = 'YOUR_PROJECT_ID'; // const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu' // const processorId = 'YOUR_PROCESSOR_ID'; // Create processor in Cloud Console // const filePath = '/path/to/local/pdf';  const {DocumentProcessorServiceClient} =   require('@google-cloud/documentai').v1;  // Instantiates a client const client = new DocumentProcessorServiceClient();  async function processDocument() {   // The full resource name of the processor, e.g.:   // projects/project-id/locations/location/processor/processor-id   // You must create new processors in the Cloud Console first   const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;    // Read the file into memory.   const fs = require('fs').promises;   const imageFile = await fs.readFile(filePath);    // Convert the image data to a Buffer and base64 encode it.   const encodedImage = Buffer.from(imageFile).toString('base64');    const request = {     name,     rawDocument: {       content: encodedImage,       mimeType: 'application/pdf',     },   };    // Recognizes text entities in the PDF document   const [result] = await client.processDocument(request);   const {document} = result;    // Get all of the document text as one big string   const {text} = document;    // Extract shards from the text field   const getText = textAnchor => {     if (!textAnchor.textSegments || textAnchor.textSegments.length === 0) {       return '';     }      // First shard in document doesn't have startIndex property     const startIndex = textAnchor.textSegments[0].startIndex || 0;     const endIndex = textAnchor.textSegments[0].endIndex;      return text.substring(startIndex, endIndex);   };    // Read the text recognition output from the processor   console.log('The document contains the following paragraphs:');   const [page1] = document.pages;   const {paragraphs} = page1;    for (const paragraph of paragraphs) {     const paragraphText = getText(paragraph.layout.textAnchor);     console.log(`Paragraph text:\n${paragraphText}`);   }    // Form parsing provides additional output about   // form-formatted PDFs. You  must create a form   // processor in the Cloud Console to see full field details.   console.log('\nThe following form key/value pairs were detected:');    const {formFields} = page1;   for (const field of formFields) {     const fieldName = getText(field.fieldName.textAnchor);     const fieldValue = getText(field.fieldValue.textAnchor);      console.log('Extracted key value pair:');     console.log(`\t(${fieldName}, ${fieldValue})`);   } }

Python

如需了解详情,请参阅 Document AI Python API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

from typing import Optional  from google.api_core.client_options import ClientOptions from google.cloud import documentai  # type: ignore  # TODO(developer): Uncomment these variables before running the sample. # project_id = "YOUR_PROJECT_ID" # location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu" # processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample # file_path = "/path/to/local/pdf" # mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types # field_mask = "text,entities,pages.pageNumber"  # Optional. The fields to return in the Document object. # processor_version_id = "YOUR_PROCESSOR_VERSION_ID" # Optional. Processor version to use   def process_document_sample(     project_id: str,     location: str,     processor_id: str,     file_path: str,     mime_type: str,     field_mask: Optional[str] = None,     processor_version_id: Optional[str] = None, ) -> None:     # You must set the `api_endpoint` if you use a location other than "us".     opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")      client = documentai.DocumentProcessorServiceClient(client_options=opts)      if processor_version_id:         # The full resource name of the processor version, e.g.:         # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`         name = client.processor_version_path(             project_id, location, processor_id, processor_version_id         )     else:         # The full resource name of the processor, e.g.:         # `projects/{project_id}/locations/{location}/processors/{processor_id}`         name = client.processor_path(project_id, location, processor_id)      # Read the file into memory     with open(file_path, "rb") as image:         image_content = image.read()      # Load binary data     raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)      # For more information: https://cloud.google.com/document-ai/docs/reference/rest/v1/ProcessOptions     # Optional: Additional configurations for processing.     process_options = documentai.ProcessOptions(         # Process only specific pages         individual_page_selector=documentai.ProcessOptions.IndividualPageSelector(             pages=[1]         )     )      # Configure the process request     request = documentai.ProcessRequest(         name=name,         raw_document=raw_document,         field_mask=field_mask,         process_options=process_options,     )      result = client.process_document(request=request)      # For a full list of `Document` object attributes, reference this page:     # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document     document = result.document      # Read the text recognition output from the processor     print("The document contains the following text:")     print(document.text)  

批处理

借助批量(异步)请求,您可以在单个请求中发送多个文档。Document AI 会返回一个 operation,您可以轮询该对象以了解请求的状态。此操作完成后,它会包含一个指向存储处理后结果的 Cloud Storage 存储桶的 BatchProcessMetadata

如果您要访问的输入文件位于其他项目的存储桶中,则必须先授予对该存储桶的访问权限,然后才能访问这些文件。请参阅设置文件访问权限

向数据处理方发送请求

以下代码示例展示了如何向处理器发送批处理请求。

REST

此示例展示了如何向 batchProcess 方法发送 POST 请求,以进行大型文档异步处理。 该示例使用通过 Google Cloud CLI 为项目设置的服务账号的访问令牌。如需了解有关安装 Google Cloud CLI、使用服务账号设置项目以及获取访问令牌的说明,请参阅准备工作

batchProcess 请求会启动长时间运行的操作,并将结果存储在 Cloud Storage 存储桶中。此示例还展示了如何在长时间运行的操作开始后获取其状态。

发送处理请求

在使用任何请求数据之前,请先进行以下替换:

  • LOCATION:处理器的位置,例如:
    • us - 美国
    • eu - 欧盟
  • PROJECT_ID:您的 Google Cloud 项目 ID。
  • PROCESSOR_ID:自定义处理器的 ID。
  • INPUT_BUCKET_FOLDER:用于读取输入文件的 Cloud Storage 存储桶/目录,采用以下格式表示:
    • gs://bucket/directory/
    发出请求的用户必须具有相应存储桶的读取权限。
  • MIME_TYPE:有效的 MIME 类型选项之一。
  • OUTPUT_BUCKET_FOLDER:用于保存输出文件的 Cloud Storage 存储桶/目录,采用以下格式表示:
    • gs://bucket/directory/
    发出请求的用户必须具有相应存储桶的写入权限。
  • skipHumanReview:一个用于停用人工审核的布尔值(仅受人机协同处理器支持)。
    • true - 跳过人工审核
    • false - 启用人工审核(默认)
  • FIELD_MASK:指定要在 Document 输出中包含哪些字段。这是完全限定字段名称的逗号分隔列表,格式为 FieldMask
    • 示例:text,entities,pages.pageNumber

† 除了使用 gcsPrefix 来包含 GCS 文件夹中的所有文件之外,您还可以使用 documents 单独列出每个文件:

   "inputDocuments": {     "gcsDocuments": {       "documents": [         {           "gcsUri": "gs://BUCKET/PATH/TO/DOCUMENT1.ext",           "mimeType": "MIME_TYPE"         },         {           "gcsUri": "gs://BUCKET/PATH/TO/DOCUMENT2.ext",           "mimeType": "MIME_TYPE"         }       ]     }   }

HTTP 方法和网址:

POST https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:batchProcess

请求 JSON 正文:

 {   "inputDocuments": {     "gcsPrefix": {       "gcsUriPrefix": "INPUT_BUCKET_FOLDER"     }   },   "documentOutputConfig": {     "gcsOutputConfig": {       "gcsUri": "OUTPUT_BUCKET_FOLDER",       "fieldMask": "FIELD_MASK"     }   },   "skipHumanReview": BOOLEAN } 

如需发送请求,请选择以下方式之一:

curl

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:batchProcess"

PowerShell

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID:batchProcess" | Select-Object -Expand Content

您应该收到类似以下内容的 JSON 响应:

 {   "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID" } 

向处理器版本发送请求

在使用任何请求数据之前,请先进行以下替换:

  • LOCATION:处理器的位置,例如:
    • us - 美国
    • eu - 欧盟
  • PROJECT_ID:您的 Google Cloud 项目 ID。
  • PROCESSOR_ID:自定义处理器的 ID。
  • PROCESSOR_VERSION:处理器版本标识符。如需了解详情,请参阅选择处理器版本。例如:
    • pretrained-TYPE-vX.X-YYYY-MM-DD
    • stable
    • rc
  • INPUT_BUCKET_FOLDER:用于读取输入文件的 Cloud Storage 存储桶/目录,采用以下格式表示:
    • gs://bucket/directory/
    发出请求的用户必须具有相应存储桶的读取权限。
  • MIME_TYPE:有效的 MIME 类型选项之一。
  • OUTPUT_BUCKET_FOLDER:用于保存输出文件的 Cloud Storage 存储桶/目录,采用以下格式表示:
    • gs://bucket/directory/
    发出请求的用户必须具有相应存储桶的写入权限。
  • skipHumanReview:一个用于停用人工审核的布尔值(仅受人机协同处理器支持)。
    • true - 跳过人工审核
    • false - 启用人工审核(默认)
  • FIELD_MASK:指定要在 Document 输出中包含哪些字段。这是完全限定字段名称的逗号分隔列表,格式为 FieldMask
    • 示例:text,entities,pages.pageNumber

† 除了使用 gcsPrefix 来包含 GCS 文件夹中的所有文件之外,您还可以使用 documents 单独列出每个文件:

   "inputDocuments": {     "gcsDocuments": {       "documents": [         {           "gcsUri": "gs://BUCKET/PATH/TO/DOCUMENT1.ext",           "mimeType": "MIME_TYPE"         },         {           "gcsUri": "gs://BUCKET/PATH/TO/DOCUMENT2.ext",           "mimeType": "MIME_TYPE"         }       ]     }   }

HTTP 方法和网址:

POST https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:batchProcess

请求 JSON 正文:

 {   "inputDocuments": {     "gcsPrefix": {       "gcsUriPrefix": "INPUT_BUCKET_FOLDER"     }   },   "documentOutputConfig": {     "gcsOutputConfig": {       "gcsUri": "OUTPUT_BUCKET_FOLDER",       "fieldMask": "FIELD_MASK"     }   },   "skipHumanReview": BOOLEAN } 

如需发送请求,请选择以下方式之一:

curl

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:batchProcess"

PowerShell

将请求正文保存在名为 request.json 的文件中,然后执行以下命令:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/processors/PROCESSOR_ID/processorVersions/PROCESSOR_VERSION:batchProcess" | Select-Object -Expand Content

您应该收到类似以下内容的 JSON 响应:

 {   "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID" } 

如果请求成功,Document AI API 将返回操作的名称。

获取结果

如需获取请求的结果,您必须向 operations 资源发送 GET 请求。下面演示了如何发送此类请求。 如需了解详情,请参阅长时间运行的操作文档。

在使用任何请求数据之前,请先进行以下替换:

  • PROJECT_ID:您的 Google Cloud 项目 ID。
  • LOCATION:LRO 的运行位置,例如:
    • us - 美国
    • eu - 欧盟
  • OPERATION_ID:您的操作的 ID。此 ID 是操作名称的最后一个元素。例如:
    • 操作名称:projects/PROJECT_ID/locations/LOCATION/operations/bc4e1d412863e626
    • 操作 ID:bc4e1d412863e626

HTTP 方法和网址:

GET https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID

如需发送请求,请选择以下方式之一:

curl

执行以下命令:

curl -X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID"

PowerShell

执行以下命令:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://LOCATION-documentai.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID" | Select-Object -Expand Content

您应该收到类似以下内容的 JSON 响应:

 {   "name": "projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID",   "metadata": {     "@type": "type.googleapis.com/google.cloud.documentai.v1.BatchProcessMetadata",     "state": "SUCCEEDED",     "stateMessage": "Processed 1 document(s) successfully",     "createTime": "TIMESTAMP",     "updateTime": "TIMESTAMP",     "individualProcessStatuses": [       {         "inputGcsSource": "INPUT_BUCKET_FOLDER/DOCUMENT1.ext",         "status": {},         "outputGcsDestination": "OUTPUT_BUCKET_FOLDER/OPERATION_ID/0",         "humanReviewStatus": {           "state": "ERROR",           "stateMessage": "Sharded document protos are not supported for human review."         }       }     ]   },   "done": true,   "response": {     "@type": "type.googleapis.com/google.cloud.documentai.v1.BatchProcessResponse"   } } 

响应正文包含一个 Operation 实例,其中包含有关操作状态的信息。如果操作已成功完成,metadata 字段将填充一个 BatchProcessMetadata 实例,其中包含有关已处理文档的信息。

C#

如需了解详情,请参阅 Document AI C# API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

using Google.Api.Gax; using Google.Cloud.DocumentAI.V1; using Google.LongRunning;  public sealed partial class GeneratedDocumentProcessorServiceClientSnippets {     /// <summary>Snippet for BatchProcessDocuments</summary>     /// <remarks>     /// This snippet has been automatically generated and should be regarded as a code template only.     /// It will require modifications to work:     /// - It may require correct/in-range values for request initialization.     /// - It may require specifying regional endpoints when creating the service client as shown in     ///   https://cloud.google.com/dotnet/docs/reference/help/client-configuration#endpoint.     /// </remarks>     public void BatchProcessDocumentsRequestObject()     {         // Create client         DocumentProcessorServiceClient documentProcessorServiceClient = DocumentProcessorServiceClient.Create();         // Initialize request argument(s)         BatchProcessRequest request = new BatchProcessRequest         {             ResourceName = new UnparsedResourceName("a/wildcard/resource"),             SkipHumanReview = false,             InputDocuments = new BatchDocumentsInputConfig(),             DocumentOutputConfig = new DocumentOutputConfig(),             ProcessOptions = new ProcessOptions(),             Labels = { { "", "" }, },         };         // Make the request         Operation<BatchProcessResponse, BatchProcessMetadata> response = documentProcessorServiceClient.BatchProcessDocuments(request);          // Poll until the returned long-running operation is complete         Operation<BatchProcessResponse, BatchProcessMetadata> completedResponse = response.PollUntilCompleted();         // Retrieve the operation result         BatchProcessResponse result = completedResponse.Result;          // Or get the name of the operation         string operationName = response.Name;         // This name can be stored, then the long-running operation retrieved later by name         Operation<BatchProcessResponse, BatchProcessMetadata> retrievedResponse = documentProcessorServiceClient.PollOnceBatchProcessDocuments(operationName);         // Check if the retrieved long-running operation has completed         if (retrievedResponse.IsCompleted)         {             // If it has completed, then access the result             BatchProcessResponse retrievedResult = retrievedResponse.Result;         }     } }

Go

如需了解详情,请参阅 Document AI Go API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

 package main  import ( 	"context"  	documentai "cloud.google.com/go/documentai/apiv1" 	documentaipb "cloud.google.com/go/documentai/apiv1/documentaipb" )  func main() { 	ctx := context.Background() 	// This snippet has been automatically generated and should be regarded as a code template only. 	// It will require modifications to work: 	// - It may require correct/in-range values for request initialization. 	// - It may require specifying regional endpoints when creating the service client as shown in: 	//   https://pkg.go.dev/cloud.google.com/go#hdr-Client_Options 	c, err := documentai.NewDocumentProcessorClient(ctx) 	if err != nil { 		// TODO: Handle error. 	} 	defer c.Close()  	req := &documentaipb.BatchProcessRequest{ 		// TODO: Fill request struct fields. 		// See https://pkg.go.dev/cloud.google.com/go/documentai/apiv1/documentaipb#BatchProcessRequest. 	} 	op, err := c.BatchProcessDocuments(ctx, req) 	if err != nil { 		// TODO: Handle error. 	}  	resp, err := op.Wait(ctx) 	if err != nil { 		// TODO: Handle error. 	} 	// TODO: Use resp. 	_ = resp } 

Java

如需了解详情,请参阅 Document AI Java API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

 import com.google.api.gax.longrunning.OperationFuture; import com.google.api.gax.paging.Page; import com.google.cloud.documentai.v1.BatchDocumentsInputConfig; import com.google.cloud.documentai.v1.BatchProcessMetadata; import com.google.cloud.documentai.v1.BatchProcessRequest; import com.google.cloud.documentai.v1.BatchProcessResponse; import com.google.cloud.documentai.v1.Document; import com.google.cloud.documentai.v1.DocumentOutputConfig; import com.google.cloud.documentai.v1.DocumentOutputConfig.GcsOutputConfig; import com.google.cloud.documentai.v1.DocumentProcessorServiceClient; import com.google.cloud.documentai.v1.DocumentProcessorServiceSettings; import com.google.cloud.documentai.v1.GcsDocument; import com.google.cloud.documentai.v1.GcsDocuments; import com.google.cloud.storage.Blob; import com.google.cloud.storage.BlobId; import com.google.cloud.storage.Bucket; import com.google.cloud.storage.Storage; import com.google.cloud.storage.StorageOptions; import com.google.protobuf.util.JsonFormat; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.util.List; import java.util.concurrent.ExecutionException; import java.util.concurrent.TimeUnit; import java.util.concurrent.TimeoutException;  public class BatchProcessDocument {   public static void batchProcessDocument()       throws IOException, InterruptedException, TimeoutException, ExecutionException {     // TODO(developer): Replace these variables before running the sample.     String projectId = "your-project-id";     String location = "your-project-location"; // Format is "us" or "eu".     String processerId = "your-processor-id";     String outputGcsBucketName = "your-gcs-bucket-name";     String outputGcsPrefix = "PREFIX";     String inputGcsUri = "gs://your-gcs-bucket/path/to/input/file.pdf";     batchProcessDocument(         projectId, location, processerId, inputGcsUri, outputGcsBucketName, outputGcsPrefix);   }    public static void batchProcessDocument(       String projectId,       String location,       String processorId,       String gcsInputUri,       String gcsOutputBucketName,       String gcsOutputUriPrefix)       throws IOException, InterruptedException, TimeoutException, ExecutionException {     // Initialize client that will be used to send requests. This client only needs     // to be created     // once, and can be reused for multiple requests. After completing all of your     // requests, call     // the "close" method on the client to safely clean up any remaining background     // resources.     String endpoint = String.format("%s-documentai.googleapis.com:443", location);     DocumentProcessorServiceSettings settings =         DocumentProcessorServiceSettings.newBuilder().setEndpoint(endpoint).build();     try (DocumentProcessorServiceClient client = DocumentProcessorServiceClient.create(settings)) {       // The full resource name of the processor, e.g.:       // projects/project-id/locations/location/processor/processor-id       // You must create new processors in the Cloud Console first       String name =           String.format("projects/%s/locations/%s/processors/%s", projectId, location, processorId);        GcsDocument gcsDocument =           GcsDocument.newBuilder().setGcsUri(gcsInputUri).setMimeType("application/pdf").build();        GcsDocuments gcsDocuments = GcsDocuments.newBuilder().addDocuments(gcsDocument).build();        BatchDocumentsInputConfig inputConfig =           BatchDocumentsInputConfig.newBuilder().setGcsDocuments(gcsDocuments).build();        String fullGcsPath = String.format("gs://%s/%s/", gcsOutputBucketName, gcsOutputUriPrefix);       GcsOutputConfig gcsOutputConfig = GcsOutputConfig.newBuilder().setGcsUri(fullGcsPath).build();        DocumentOutputConfig documentOutputConfig =           DocumentOutputConfig.newBuilder().setGcsOutputConfig(gcsOutputConfig).build();        // Configure the batch process request.       BatchProcessRequest request =           BatchProcessRequest.newBuilder()               .setName(name)               .setInputDocuments(inputConfig)               .setDocumentOutputConfig(documentOutputConfig)               .build();        OperationFuture<BatchProcessResponse, BatchProcessMetadata> future =           client.batchProcessDocumentsAsync(request);        // Batch process document using a long-running operation.       // You can wait for now, or get results later.       // Note: first request to the service takes longer than subsequent       // requests.       System.out.println("Waiting for operation to complete...");       future.get();        System.out.println("Document processing complete.");        Storage storage = StorageOptions.newBuilder().setProjectId(projectId).build().getService();       Bucket bucket = storage.get(gcsOutputBucketName);        // List all of the files in the Storage bucket.       Page<Blob> blobs = bucket.list(Storage.BlobListOption.prefix(gcsOutputUriPrefix + "/"));       int idx = 0;       for (Blob blob : blobs.iterateAll()) {         if (!blob.isDirectory()) {           System.out.printf("Fetched file #%d\n", ++idx);           // Read the results            // Download and store json data in a temp file.           File tempFile = File.createTempFile("file", ".json");           Blob fileInfo = storage.get(BlobId.of(gcsOutputBucketName, blob.getName()));           fileInfo.downloadTo(tempFile.toPath());            // Parse json file into Document.           FileReader reader = new FileReader(tempFile);           Document.Builder builder = Document.newBuilder();           JsonFormat.parser().merge(reader, builder);            Document document = builder.build();            // Get all of the document text as one big string.           String text = document.getText();            // Read the text recognition output from the processor           System.out.println("The document contains the following paragraphs:");           Document.Page page1 = document.getPages(0);           List<Document.Page.Paragraph> paragraphList = page1.getParagraphsList();           for (Document.Page.Paragraph paragraph : paragraphList) {             String paragraphText = getText(paragraph.getLayout().getTextAnchor(), text);             System.out.printf("Paragraph text:%s\n", paragraphText);           }            // Form parsing provides additional output about           // form-formatted PDFs. You must create a form           // processor in the Cloud Console to see full field details.           System.out.println("The following form key/value pairs were detected:");            for (Document.Page.FormField field : page1.getFormFieldsList()) {             String fieldName = getText(field.getFieldName().getTextAnchor(), text);             String fieldValue = getText(field.getFieldValue().getTextAnchor(), text);              System.out.println("Extracted form fields pair:");             System.out.printf("\t(%s, %s))", fieldName, fieldValue);           }            // Clean up temp file.           tempFile.deleteOnExit();         }       }     }   }    // Extract shards from the text field   private static String getText(Document.TextAnchor textAnchor, String text) {     if (textAnchor.getTextSegmentsList().size() > 0) {       int startIdx = (int) textAnchor.getTextSegments(0).getStartIndex();       int endIdx = (int) textAnchor.getTextSegments(0).getEndIndex();       return text.substring(startIdx, endIdx);     }     return "[NO TEXT]";   } }

Node.js

如需了解详情,请参阅 Document AI Node.js API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

/**  * TODO(developer): Uncomment these variables before running the sample.  */ // const projectId = 'YOUR_PROJECT_ID'; // const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu' // const processorId = 'YOUR_PROCESSOR_ID'; // const gcsInputUri = 'YOUR_SOURCE_PDF'; // const gcsOutputUri = 'YOUR_STORAGE_BUCKET'; // const gcsOutputUriPrefix = 'YOUR_STORAGE_PREFIX';  // Imports the Google Cloud client library const {DocumentProcessorServiceClient} =   require('@google-cloud/documentai').v1; const {Storage} = require('@google-cloud/storage');  // Instantiates Document AI, Storage clients const client = new DocumentProcessorServiceClient(); const storage = new Storage();  const {default: PQueue} = require('p-queue');  async function batchProcessDocument() {   const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;    // Configure the batch process request.   const request = {     name,     inputDocuments: {       gcsDocuments: {         documents: [           {             gcsUri: gcsInputUri,             mimeType: 'application/pdf',           },         ],       },     },     documentOutputConfig: {       gcsOutputConfig: {         gcsUri: `${gcsOutputUri}/${gcsOutputUriPrefix}/`,       },     },   };    // Batch process document using a long-running operation.   // You can wait for now, or get results later.   // Note: first request to the service takes longer than subsequent   // requests.   const [operation] = await client.batchProcessDocuments(request);    // Wait for operation to complete.   await operation.promise();   console.log('Document processing complete.');    // Query Storage bucket for the results file(s).   const query = {     prefix: gcsOutputUriPrefix,   };    console.log('Fetching results ...');    // List all of the files in the Storage bucket   const [files] = await storage.bucket(gcsOutputUri).getFiles(query);    // Add all asynchronous downloads to queue for execution.   const queue = new PQueue({concurrency: 15});   const tasks = files.map((fileInfo, index) => async () => {     // Get the file as a buffer     const [file] = await fileInfo.download();      console.log(`Fetched file #${index + 1}:`);      // The results stored in the output Storage location     // are formatted as a document object.     const document = JSON.parse(file.toString());     const {text} = document;      // Extract shards from the text field     const getText = textAnchor => {       if (!textAnchor.textSegments || textAnchor.textSegments.length === 0) {         return '';       }        // First shard in document doesn't have startIndex property       const startIndex = textAnchor.textSegments[0].startIndex || 0;       const endIndex = textAnchor.textSegments[0].endIndex;        return text.substring(startIndex, endIndex);     };      // Read the text recognition output from the processor     console.log('The document contains the following paragraphs:');      const [page1] = document.pages;     const {paragraphs} = page1;     for (const paragraph of paragraphs) {       const paragraphText = getText(paragraph.layout.textAnchor);       console.log(`Paragraph text:\n${paragraphText}`);     }      // Form parsing provides additional output about     // form-formatted PDFs. You  must create a form     // processor in the Cloud Console to see full field details.     console.log('\nThe following form key/value pairs were detected:');      const {formFields} = page1;     for (const field of formFields) {       const fieldName = getText(field.fieldName.textAnchor);       const fieldValue = getText(field.fieldValue.textAnchor);        console.log('Extracted key value pair:');       console.log(`\t(${fieldName}, ${fieldValue})`);     }   });   await queue.addAll(tasks); }

Python

如需了解详情,请参阅 Document AI Python API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

import re from typing import Optional  from google.api_core.client_options import ClientOptions from google.api_core.exceptions import InternalServerError from google.api_core.exceptions import RetryError from google.cloud import documentai  # type: ignore from google.cloud import storage  # TODO(developer): Uncomment these variables before running the sample. # project_id = "YOUR_PROJECT_ID" # location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu" # processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample # gcs_output_uri = "YOUR_OUTPUT_URI" # Must end with a trailing slash `/`. Format: gs://bucket/directory/subdirectory/ # processor_version_id = "YOUR_PROCESSOR_VERSION_ID" # Optional. Example: pretrained-ocr-v1.0-2020-09-23  # TODO(developer): You must specify either `gcs_input_uri` and `mime_type` or `gcs_input_prefix` # gcs_input_uri = "YOUR_INPUT_URI" # Format: gs://bucket/directory/file.pdf # input_mime_type = "application/pdf" # gcs_input_prefix = "YOUR_INPUT_URI_PREFIX" # Format: gs://bucket/directory/ # field_mask = "text,entities,pages.pageNumber"  # Optional. The fields to return in the Document object.   def batch_process_documents(     project_id: str,     location: str,     processor_id: str,     gcs_output_uri: str,     processor_version_id: Optional[str] = None,     gcs_input_uri: Optional[str] = None,     input_mime_type: Optional[str] = None,     gcs_input_prefix: Optional[str] = None,     field_mask: Optional[str] = None,     timeout: int = 400, ) -> None:     # You must set the `api_endpoint` if you use a location other than "us".     opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")      client = documentai.DocumentProcessorServiceClient(client_options=opts)      if gcs_input_uri:         # Specify specific GCS URIs to process individual documents         gcs_document = documentai.GcsDocument(             gcs_uri=gcs_input_uri, mime_type=input_mime_type         )         # Load GCS Input URI into a List of document files         gcs_documents = documentai.GcsDocuments(documents=[gcs_document])         input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)     else:         # Specify a GCS URI Prefix to process an entire directory         gcs_prefix = documentai.GcsPrefix(gcs_uri_prefix=gcs_input_prefix)         input_config = documentai.BatchDocumentsInputConfig(gcs_prefix=gcs_prefix)      # Cloud Storage URI for the Output Directory     gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(         gcs_uri=gcs_output_uri, field_mask=field_mask     )      # Where to write results     output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)      if processor_version_id:         # The full resource name of the processor version, e.g.:         # projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}         name = client.processor_version_path(             project_id, location, processor_id, processor_version_id         )     else:         # The full resource name of the processor, e.g.:         # projects/{project_id}/locations/{location}/processors/{processor_id}         name = client.processor_path(project_id, location, processor_id)      request = documentai.BatchProcessRequest(         name=name,         input_documents=input_config,         document_output_config=output_config,     )      # BatchProcess returns a Long Running Operation (LRO)     operation = client.batch_process_documents(request)      # Continually polls the operation until it is complete.     # This could take some time for larger files     # Format: projects/{project_id}/locations/{location}/operations/{operation_id}     try:         print(f"Waiting for operation {operation.operation.name} to complete...")         operation.result(timeout=timeout)     # Catch exception when operation doesn't finish before timeout     except (RetryError, InternalServerError) as e:         print(e.message)      # NOTE: Can also use callbacks for asynchronous processing     #     # def my_callback(future):     #   result = future.result()     #     # operation.add_done_callback(my_callback)      # After the operation is complete,     # get output document information from operation metadata     metadata = documentai.BatchProcessMetadata(operation.metadata)      if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:         raise ValueError(f"Batch Process Failed: {metadata.state_message}")      storage_client = storage.Client()      print("Output files:")     # One process per Input Document     for process in list(metadata.individual_process_statuses):         # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/INPUT_FILE_NUMBER/         # The Cloud Storage API requires the bucket name and URI prefix separately         matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)         if not matches:             print(                 "Could not parse output GCS destination:",                 process.output_gcs_destination,             )             continue          output_bucket, output_prefix = matches.groups()          # Get List of Document Objects from the Output Bucket         output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)          # Document AI may output multiple JSON files per source file         for blob in output_blobs:             # Document AI should only output JSON files to GCS             if blob.content_type != "application/json":                 print(                     f"Skipping non-supported file: {blob.name} - Mimetype: {blob.content_type}"                 )                 continue              # Download JSON File as bytes object and convert to Document Object             print(f"Fetching {blob.name}")             document = documentai.Document.from_json(                 blob.download_as_bytes(), ignore_unknown_fields=True             )              # For a full list of Document object attributes, please reference this page:             # https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document              # Read the text recognition output from the processor             print("The document contains the following text:")             print(document.text)  

Go

如需了解详情,请参阅 Document AI Go API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

use Google\ApiCore\ApiException; use Google\ApiCore\OperationResponse; use Google\Cloud\DocumentAI\V1\BatchProcessRequest; use Google\Cloud\DocumentAI\V1\BatchProcessResponse; use Google\Cloud\DocumentAI\V1\Client\DocumentProcessorServiceClient; use Google\Rpc\Status;  /**  * LRO endpoint to batch process many documents. The output is written  * to Cloud Storage as JSON in the [Document] format.  *  * @param string $name The resource name of  *                     [Processor][google.cloud.documentai.v1.Processor] or  *                     [ProcessorVersion][google.cloud.documentai.v1.ProcessorVersion].  *                     Format: `projects/{project}/locations/{location}/processors/{processor}`,  *                     or  *                     `projects/{project}/locations/{location}/processors/{processor}/processorVersions/{processorVersion}`  */ function batch_process_documents_sample(string $name): void {     // Create a client.     $documentProcessorServiceClient = new DocumentProcessorServiceClient();      // Prepare the request message.     $request = (new BatchProcessRequest())         ->setName($name);      // Call the API and handle any network failures.     try {         /** @var OperationResponse $response */         $response = $documentProcessorServiceClient->batchProcessDocuments($request);         $response->pollUntilComplete();          if ($response->operationSucceeded()) {             /** @var BatchProcessResponse $result */             $result = $response->getResult();             printf('Operation successful with response data: %s' . PHP_EOL, $result->serializeToJsonString());         } else {             /** @var Status $error */             $error = $response->getError();             printf('Operation failed with error data: %s' . PHP_EOL, $error->serializeToJsonString());         }     } catch (ApiException $ex) {         printf('Call failed with message: %s' . PHP_EOL, $ex->getMessage());     } }  /**  * Helper to execute the sample.  *  * This sample has been automatically generated and should be regarded as a code  * template only. It will require modifications to work:  *  - It may require correct/in-range values for request initialization.  *  - It may require specifying regional endpoints when creating the service client,  *    please see the apiEndpoint client configuration option for more details.  */ function callSample(): void {     $name = '[NAME]';      batch_process_documents_sample($name); }

Ruby

如需了解详情,请参阅 Document AI Ruby API 参考文档

如需向 Document AI 进行身份验证,请设置应用默认凭据。 如需了解详情,请参阅为本地开发环境设置身份验证

require "google/cloud/document_ai/v1"  ## # Snippet for the batch_process_documents call in the DocumentProcessorService service # # This snippet has been automatically generated and should be regarded as a code # template only. It will require modifications to work: # - It may require correct/in-range values for request initialization. # - It may require specifying regional endpoints when creating the service # client as shown in https://cloud.google.com/ruby/docs/reference. # # This is an auto-generated example demonstrating basic usage of # Google::Cloud::DocumentAI::V1::DocumentProcessorService::Client#batch_process_documents. # def batch_process_documents   # Create a client object. The client can be reused for multiple calls.   client = Google::Cloud::DocumentAI::V1::DocumentProcessorService::Client.new    # Create a request. To set request fields, pass in keyword arguments.   request = Google::Cloud::DocumentAI::V1::BatchProcessRequest.new    # Call the batch_process_documents method.   result = client.batch_process_documents request    # The returned object is of type Gapic::Operation. You can use it to   # check the status of an operation, cancel it, or wait for results.   # Here is how to wait for a response.   result.wait_until_done! timeout: 60   if result.response?     p result.response   else     puts "No response received."   end end

使用 Python SDK 创建文档批次

批量处理每个请求最多允许 1,000 个文件。如果您有更多文档需要处理,则必须将这些文档分成多个批次进行处理。

Document AI Toolbox 是一款适用于 Python 的 SDK,可为 Document AI 提供实用函数。其中一个函数用于从 Cloud Storage 文件夹中创建批次文档以供处理。

如需详细了解 Document AI Toolbox 如何协助进行后处理,请参阅处理处理响应

代码示例

以下代码示例演示了如何使用 Document AI 工具箱。

文档批次

 from google.cloud import documentai from google.cloud.documentai_toolbox import gcs_utilities  # TODO(developer): Uncomment these variables before running the sample. # Given unprocessed documents in path gs://bucket/path/to/folder # gcs_bucket_name = "bucket" # gcs_prefix = "path/to/folder" # batch_size = 50   def create_batches_sample(     gcs_bucket_name: str,     gcs_prefix: str,     batch_size: int = 50, ) -> None:     # Creating batches of documents for processing     batches = gcs_utilities.create_batches(         gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix, batch_size=batch_size     )      print(f"{len(batches)} batch(es) created.")     for batch in batches:         print(f"{len(batch.gcs_documents.documents)} files in batch.")         print(batch.gcs_documents.documents)          # Use as input for batch_process_documents()         # Refer to https://cloud.google.com/document-ai/docs/send-request         # for how to send a batch processing request         request = documentai.BatchProcessRequest(             name="processor_name", input_documents=batch         )         print(request)