This page shows you how to use Vertex AI managed datasets to train your custom models. Managed datasets offer the following benefits:
- Manage your datasets in a central location.
- Easily create labels and multiple annotation sets.
- Create tasks for human labeling using integrated data labeling.
- Track lineage to models for governance and iterative development.
- Compare model performance by training AutoML and custom models using the same datasets.
- Generate data statistics and visualizations.
- Automatically split data into training, test, and validation sets.
Before you begin
Before you can use a managed dataset in your training application, you must
create your dataset. You must create the dataset and the training
pipeline that you use for training in the same region. You must use a region
where Dataset
resources are
available.
Access a dataset from your training application
When you create a custom training pipeline, you can specify that your training application uses a Vertex AI dataset.
At runtime, Vertex AI passes metadata about your dataset to your training application by setting the following environment variables in your training container.
AIP_DATA_FORMAT
: The format that your dataset is exported in. Possible values include:jsonl
,csv
, orbigquery
.AIP_TRAINING_DATA_URI
: The BigQuery URI of your training data or the Cloud Storage URI of your training data file.AIP_VALIDATION_DATA_URI
: The BigQuery URI for your validation data or the Cloud Storage URI of your validation data file.AIP_TEST_DATA_URI
: The BigQuery URI for your test data or the Cloud Storage URI of your test data file.
If the AIP_DATA_FORMAT
of your dataset is jsonl
or csv
, the data URI
values refer to Cloud Storage URIs, like
gs://bucket_name/path/training-*
. To keep
the size of each data file relatively small, Vertex AI splits your
dataset into multiple files. Because your training, validation, or test data
may be split into multiple files, the URIs are provided in wildcard format.
Learn more about downloading objects using the Cloud Storage code samples.
If your AIP_DATA_FORMAT
is bigquery
, the data URI values refer to
BigQuery URIs, like
bq://project.dataset.table
.
Learn more about paging through BigQuery data.
Dataset format
Use the following sections to learn more about how Vertex AI formats your data when passing a dataset to your training application.
Image datasets
Image datasets are passed to your training application in JSON Lines format. Select the tab for your dataset's objective, to learn more about how Vertex AI formats your dataset.
Single-label classification
Vertex AI uses the following publicly accessible schema when exporting a single-label image classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "imageGcsUri": "gs://bucket/filename.ext", "classificationAnnotation": { "displayName": "LABEL", "annotationResourceLabels": { "aiplatform.googleapis.com/annotation_set_name": "displayName", "env": "prod" } }, "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training/test/validation" } }
Field notes:
imageGcsUri
: The Cloud Storage URI of this image.annotationResourceLabels
: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.dataItemResourceLabels
- Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.
Example JSON Lines
{"imageGcsUri": "gs://bucket/filename1.jpeg", "classificationAnnotation": {"displayName": "daisy"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}} {"imageGcsUri": "gs://bucket/filename2.gif", "classificationAnnotation": {"displayName": "dandelion"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename3.png", "classificationAnnotation": {"displayName": "roses"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename4.bmp", "classificationAnnotation": {"displayName": "sunflowers"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename5.tiff", "classificationAnnotation": {"displayName": "tulips"}, "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}} ...
Multi-label classification
Vertex AI uses the following publicly accessible schema when exporting a multi-label image classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "imageGcsUri": "gs://bucket/filename.ext", "classificationAnnotations": [ { "displayName": "LABEL1", "annotationResourceLabels": { "aiplatform.googleapis.com/annotation_set_name":"displayName", "label_type": "flower_type" } }, { "displayName": "LABEL2", "annotationResourceLabels": { "aiplatform.googleapis.com/annotation_set_name":"displayName", "label_type": "image_shot_type" } } ], "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training/test/validation" } }
Field notes:
imageGcsUri
: The Cloud Storage URI of this image.annotationResourceLabels
: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.dataItemResourceLabels
- Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.
Example JSON Lines
{"imageGcsUri": "gs://bucket/filename1.jpeg", "classificationAnnotations": [{"displayName": "daisy"}, {"displayName": "full_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}} {"imageGcsUri": "gs://bucket/filename2.gif", "classificationAnnotations": [{"displayName": "dandelion"}, {"displayName": "medium_shot"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename3.png", "classificationAnnotations": [{"displayName": "roses"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename4.bmp", "classificationAnnotations": [{"displayName": "sunflowers"}, {"displayName": "closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename5.tiff", "classificationAnnotations": [{"displayName": "tulips"}, {"displayName": "extreme_closeup"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}} ...
Object detection
Vertex AI uses the following publicly accessible schema when exporting an object detection dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
gs://google-cloud-aiplatform/schema/dataset/ioformat/image_bounding_box_io_format_1.0.0.yaml
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "imageGcsUri": "gs://bucket/filename.ext", "boundingBoxAnnotations": [ { "displayName": "OBJECT1_LABEL", "xMin": "X_MIN", "yMin": "Y_MIN", "xMax": "X_MAX", "yMax": "Y_MAX", "annotationResourceLabels": { "aiplatform.googleapis.com/annotation_set_name": "displayName", "env": "prod" } }, { "displayName": "OBJECT2_LABEL", "xMin": "X_MIN", "yMin": "Y_MIN", "xMax": "X_MAX", "yMax": "Y_MAX" } ], "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "test/train/validation" } }
Field notes:
imageGcsUri
: The Cloud Storage URI of this image.annotationResourceLabels
: Contains any number of key-value string pairs. Vertex AI uses this field to specify the annotation set.dataItemResourceLabels
- Contains any number of key-value string pairs. Specifies the machine learning use of the data item, such as training, test, or validation.
Example JSON Lines
{"imageGcsUri": "gs://bucket/filename1.jpeg", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.3", "yMin": "0.3", "xMax": "0.7", "yMax": "0.6"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}} {"imageGcsUri": "gs://bucket/filename2.gif", "boundingBoxAnnotations": [{"displayName": "Tomato", "xMin": "0.8", "yMin": "0.2", "xMax": "1.0", "yMax": "0.4"},{"displayName": "Salad", "xMin": "0.0", "yMin": "0.0", "xMax": "1.0", "yMax": "1.0"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename3.png", "boundingBoxAnnotations": [{"displayName": "Baked goods", "xMin": "0.5", "yMin": "0.7", "xMax": "0.8", "yMax": "0.8"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"imageGcsUri": "gs://bucket/filename4.tiff", "boundingBoxAnnotations": [{"displayName": "Salad", "xMin": "0.1", "yMin": "0.2", "xMax": "0.8", "yMax": "0.9"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "validation"}} ...
Tabular datasets
Vertex AI passes tabular data to your training application in CSV format or as a URI to a BigQuery table or view. For more information about the data source format and requirements, see Preparing your import source. Refer to the dataset in Google Cloud console for more information about the dataset schema.
Text datasets
Text datasets are passed to your training application in JSON Lines format. Select the tab for your dataset's objective, to learn more about how Vertex AI formats your dataset.
Single-label classification
Vertex AI uses the following publicly accessible schema when exporting a single-label text classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "classificationAnnotation": { "displayName": "label" }, "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "classificationAnnotation": { "displayName": "label2" }, "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
Multi-label classification
Vertex AI uses the following publicly accessible schema when exporting a multi-label text classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "classificationAnnotations": [{ "displayName": "label1" },{ "displayName": "label2" }], "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "classificationAnnotations": [{ "displayName": "label2" },{ "displayName": "label3" }], "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
Entity extraction
Vertex AI uses the following publicly accessible schema when exporting an entity extraction dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
gs://google-cloud-aiplatform/schema/dataset/ioformat/text_extraction_io_format_1.0.0.yaml.
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "textSegmentAnnotations": [ { "startOffset":number, "endOffset":number, "displayName": "label" }, ... ], "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "textSegmentAnnotations": [ { "startOffset":number, "endOffset":number, "displayName": "label" }, ... ], "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
Sentiment analysis
Vertex AI uses the following publicly accessible schema when exporting a sentiment analysis dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_text_sentiment_1.0.0.yaml
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "sentimentAnnotation": { "sentiment": number, "sentimentMax": number }, "textContent": "inline_text", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } } { "sentimentAnnotation": { "sentiment": number, "sentimentMax": number }, "textGcsUri": "gcs_uri_to_file", "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "training|test|validation" } }
Video datasets
Video datasets are passed to your training application in JSON Lines format. Select the tab for your dataset's objective, to learn more about how Vertex AI formats your dataset.
Action recognition
Vertex AI uses the following publicly accessible schema when exporting an action recognition dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
gs://google-cloud-aiplatform/schema/dataset/ioformat/video_action_recognition_io_format_1.0.0.yaml
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "videoGcsUri': "gs://bucket/filename.ext", "timeSegments": [{ "startTime": "start_time_of_fully_annotated_segment", "endTime": "end_time_of_segment"}], "timeSegmentAnnotations": [{ "displayName": "LABEL", "startTime": "start_time_of_segment", "endTime": "end_time_of_segment" }], "dataItemResourceLabels": { "ml_use": "train|test" } }
Note: The time segments here are used to calculate the timestamps
of the actions. startTime
and endTime
of
timeSegmentAnnotations
can
be equal, and corresponds to the key frame of the action.
Example JSON Lines
{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"ml_use": "training"}} {"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"ml_use": "test"}} ...
Classification
Vertex AI uses the following publicly accessible schema when exporting a classification dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
gs://google-cloud-aiplatform/schema/dataset/ioformat/video_classification_io_format_1.0.0.yaml
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "videoGcsUri": "gs://bucket/filename.ext", "timeSegmentAnnotations": [{ "displayName": "LABEL", "startTime": "start_time_of_segment", "endTime": "end_time_of_segment" }], "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "train|test" } }
Example JSON Lines - Video classification:
{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}} ...
Object tracking
Vertex AI uses the following publicly accessible schema when exporting an object tracking dataset. This schema dictates the format of the data export files. The schema's structure follows the OpenAPI schema.
gs://google-cloud-aiplatform/schema/dataset/ioformat/object_tracking_io_format_1.0.0.yaml
Each data item in your exported dataset uses the following format. This example includes line breaks for readability.
{ "videoGcsUri": "gs://bucket/filename.ext", "TemporalBoundingBoxAnnotations": [{ "displayName": "LABEL", "xMin": "leftmost_coordinate_of_the_bounding box", "xMax": "rightmost_coordinate_of_the_bounding box", "yMin": "topmost_coordinate_of_the_bounding box", "yMax": "bottommost_coordinate_of_the_bounding box", "timeOffset": "timeframe_object-detected" "instanceId": "instance_of_object "annotationResourceLabels": "resource_labels" }], "dataItemResourceLabels": { "aiplatform.googleapis.com/ml_use": "train|test" } }
Example JSON Lines
{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "training"}} {'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": "test"}} ...