This page shows you how to create a Vertex AI dataset from your text data so you can start training classification models. You can create a dataset using either the Google Cloud console or the Vertex AI API.
Create an empty dataset and import or associate your data
Google Cloud console
Use the following instructions to create an empty dataset and either import or associate your data.
- In the Google Cloud console, in the Vertex AI section, go to the Datasets page.
- Click Create to open the create dataset details page.
- Modify the Dataset name field to create a descriptive dataset display name.
- Select the Text tab.
- Select Single-label classification or Multi-label classification.
- Select a region from the Region drop-down list.
- Click Create to create your empty dataset, and advance to the data import page.
- Choose one of the following options from the Select an import method
section:
Upload data from your computer
- In the Select an import method section, choose to upload data from your computer.
- Click Select files and choose all the local files to upload to a Cloud Storage bucket.
- In the Select a Cloud Storage path section click Browse to choose a Cloud Storage bucket location to upload your data to.
Upload an import file from your computer
- Click Upload an import file from your computer.
- Click Select files and choose the local import file to upload to a Cloud Storage bucket.
- In the Select a Cloud Storage path section click Browse to choose a Cloud Storage bucket location to upload your file to.
Select an import file from Cloud Storage
- Click Select an import file from Cloud Storage.
- In the Select a Cloud Storage path section click Browse to choose the import file in Cloud Storage.
- Click Continue.
Data import can take several hours, depending on the size of your data. You can close this tab and return to it later. You will receive an email when your data is imported.
API
In order to create a machine learning model you must first have a representative collection of data to train with. After importing data you can make modifications and start model training.
Create a dataset
Use the following samples to create a dataset for your data.
REST
Before using any of the request data, make the following replacements:
-
LOCATION: Region where the dataset will be stored. This must be a
region that supports dataset resources. For example,
us-central1
. See List of available locations. - PROJECT_ID: Your project ID
- DATASET_NAME: Name for the dataset.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets
Request JSON body:
{ "display_name": "DATASET_NAME", "metadata_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/metadata/text_1.0.0.yaml" }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets" | Select-Object -Expand Content
You should see output similar to the following. You can use the OPERATION_ID in the response to get the status of the operation.
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateDatasetOperationMetadata", "genericMetadata": { "createTime": "2020-07-07T21:27:35.964882Z", "updateTime": "2020-07-07T21:27:35.964882Z" } } }
Terraform
The following sample uses the google_vertex_ai_dataset
Terraform resource to create a text dataset named text-dataset
.
To learn how to apply or remove a Terraform configuration, see Basic Terraform commands.
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
The following sample uses the Vertex AI SDK for Python to both create a dataset and import data. If you run this sample code, then you can skip the Import data section of this guide.
This particular sample imports data for single-label classification. If your model has a different objective, then you must adjust the code.
Import data
After you create an empty dataset you can import your data into the dataset. If you used the Vertex AI SDK for Python to create the dataset, then you might have already imported data when you created the dataset. If so, you can skip this section.
Select the tab below for your objective:
Single-label classification
REST
Before using any of the request data, make the following replacements:
- LOCATION: Region where your dataset will be stored. For example,
us-central1
. - PROJECT_ID: Your project ID.
- DATASET_ID: ID of the dataset.
- IMPORT_FILE_URI: Path to the CSV or JSON Lines file in Cloud Storage that lists data items stored in Cloud Storage to use for model training; for import file formats and limitations, see Preparing text data.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:import
Request JSON body:
{ "import_configs": [ { "gcs_source": { "uris": "IMPORT_FILE_URI" }, "import_schema_uri" : "gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_single_label_io_format_1.0.0.yaml" } ] }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:import"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:import" | Select-Object -Expand Content
You should see output similar to the following. You can use the OPERATION_ID in the response to get the status of the operation.
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.ImportDataOperationMetadata", "genericMetadata": { "createTime": "2020-07-08T20:32:02.543801Z", "updateTime": "2020-07-08T20:32:02.543801Z" } } }
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Multi-label classification
REST
Before using any of the request data, make the following replacements:
- LOCATION: Region where your dataset will be stored. For example,
us-central1
. - PROJECT_ID: Your project ID.
- DATASET_ID: ID of the dataset.
- IMPORT_FILE_URI: Path to the CSV or JSON Lines file in Cloud Storage that lists data items stored in Cloud Storage to use for model training; for import file formats and limitations, see Preparing text data.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:import
Request JSON body:
{ "import_configs": [ { "gcs_source": { "uris": "IMPORT_FILE_URI" }, "import_schema_uri" : "gs://google-cloud-aiplatform/schema/dataset/ioformat/text_classification_multi_label_io_format_1.0.0.yaml" } ] }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:import"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/datasets/DATASET_ID:import" | Select-Object -Expand Content
You should see output similar to the following. You can use the OPERATION_ID in the response to get the status of the operation.
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION/datasets/DATASET_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.ImportDataOperationMetadata", "genericMetadata": { "createTime": "2020-07-08T20:32:02.543801Z", "updateTime": "2020-07-08T20:32:02.543801Z" } } }
Get operation status
Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.