This page shows you how to start and manage a Dataproc Metastore managed migration.
You can configure a migration using the Dataproc Metastore APIs.
Before you begin
- Understand how a managed migration works.
- Set up the managed migration prerequisites.
Start migration
When you run a start migration, Dataproc Metastore connects to Cloud SQL and uses Cloud SQL as its backend database. During this process, Dataproc Metastore runs a pipeline that copies data from Cloud SQL to its own database (Spanner).
Dataproc Metastore continues to use Cloud SQL as its backend and replicates data until the complete migration process is called.
Before you start a migration, make sure you have set up the managed migration prerequisites.
Start migration considerations
A Dataproc Metastore service can only run a single migration at a time.
A migration remains active until you complete the migration process. There isn't a deadline to complete your migration, for example, the migration can take 1 day, 30 days, or a year.
Scheduled backups are not restricted during a migration. However, the backup might be incomplete. To avoid any issues, disable any scheduled backups while the migration is in progress.
A start migration triggers the following state changes:
- Dataproc Metastore moves to the
MIGRATING
state. - The migration execution state state moves to
RUNNING
. The migration execution phase moves to
REPLICATION
.
Console
Get started
In the Google Cloud console, open the Dataproc Metastore page:
On the Dataproc Metastore page, click the name of the service you want to migrate to.
The Service detail page opens.
At the top of the page, click Migrate Data.
The Create migration page opens to the Connectivity tab and displays the Cloud SQL database configuration for Dataproc Metastore configuration settings.
Cloud SQL database configuration for DPMS
In the Instance connection name, enter the instance connection name of the Cloud SQL database, in the following format:
project_id:region:instance_name
.In the IP address field, enter the IP address required to connect to the Cloud SQL instance.
In the Port field, enter 3306.
In the Hive database name enter the name of the database being used as the backend of self-managed Hive Metastore.
In the Username field, enter the username that you use to connect Cloud SQL to the Hive Metastore.
In the Password field, enter the password that you use to connect Cloud SQL to the Hive Metastore.
SOCKS5 Proxy service
In the Proxy Subnet field, enter a subnet of Regular type. The subnetwork should be present in the Cloud SQL VPC network. This subnet is used to deploy the intermediate SOCKS5 proxy service
In the Nat Subnet field, enter a subnet of Private Service Connect type. This subnetwork should be present in the Cloud SQL VPC network and is used to publish the SOCKS5 proxy service using private service connect.
Click Continue.
The Change Data Capture (CDC) tab opens and displays the Cloud SQL database configuration for Datastream configuration settings.
Cloud SQL database configuration for data stream
In the Username field, enter the username that you use to login to the Cloud SQL CDC used by Datastream.
In the Password field, enter the password that you use to login to the Cloud SQL CDC used by Datastream.
In the VPC network field, enter the network in the same VPC network as the Cloud SQL instance used by Datastream to establish a private connection to the CDC.
In the Subnet IP range field, enter a subnet IP range of at least
/29
. Datastream uses this IP to establish peering to the VPC network.In the Reverse proxy subnet field, enter the subnetwork you created in the same VPC network as the Cloud SQL. Datastream uses this subnetwork. The subnetwork is used to host a reverse proxy connection for the Datastream CDC. The subnet must be configured in the same region as the Dataproc Metastore service.
GCS configuration
For the Bucket ID, select the Cloud Storage path to store CDC data during the migration.
In the Root path field, enter the root path inside the Cloud Storage bucket. The stream event data is written to this path.
Click Create.
REST
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type:application/json" \
-X POST -d \
'{
"migration_execution": {
"cloud_sql_migration_config": {
"cloud_sql_connection_config": {
"instance_connection_name": INSTANCE_CONNECTION_NAME,
"hive_database_name": "HIVE_DATABASE_NAME",
"ip_address": "IP_ADDRESS",
"port": 3306,
"username": "CONNECTION_USERNAME",
"password": "CONNECTION_PASSWORD",
"proxy_subnet": "PROXY_SUBNET",
"nat_subnet": "NAT_SUBNET"
},
"cdc_config": {
"username": "CDC_USENAME",
"password": "CDC_PASSWORD",
"vpc_network": "VPC_NETWORK",
"subnet_ip_range": "SUBNET_IP_RANGE",
"reverse_proxy_subnet": "REVERSE_PROXY_SUBNET_ID",
"bucket": "BUCKET_NAME",
"root_path": "ROOT_PATH",
}
}
}
}' \
https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:startMigration
Replace the following:
SERVICE
: the name or ID of your Dataproc Metastore service.PROJECT_ID
: the project ID of the Google Cloud project your Dataproc Metastore service resides in.LOCATION
: the Google Cloud region in which your Dataproc Metastore service resides.
Cloud SQL Migration configuration
INSTANCE_CONNECTION_NAME
: the instance connection name for the Cloud SQL database, in the following format:PROJECT_ID/LOCATION/CLOUDSQL_INSTANCE_ID
.HIVE_DATABASE_NAME
: the name of the self managed Hive database connected to Cloud SQL.IP_ADDRESS
: the IP address required to connect to the Cloud SQL instance.CONNECTION_USERNAME
: the username that you use to connect Cloud SQL to the Hive Metastore.CONNECTION_PASSWORD
the password that you use to connect Cloud SQL to the Hive MetastorePROXY_SUBNET
: the subnetwork used in the Cloud SQL VPC network. This subnetwork hosts an intermediate proxy to provide connectivity across transitive networks.NAT_SUBNET
: a Private Service Connect subnet that provides a connection from the Dataproc Metastore service to access to the intermediate proxy. The subnet size should have a prefix length of at least /29 and in the IPv4 range.
CDC configuration
CDC_USERNAME
: the username that the Datastream service uses to login into Cloud SQL.CDC_PASSWORD
: the password that the Datastream service uses to login into Cloud SQL.VPC_NETWORK
: a network in the same VPC network as the Cloud SQL instance used by Datastream to establish a private connection to the CDC.SUBNET_IP_RANGE
: A subnet IP range of at least /29 used by Datastream to establish peering to the VPC network.REVERSE_PROXY_SUBNET_ID
: a subnetwork in the same VPC network as the Cloud SQL instance used by Datastream. The subnetwork is used to host a reverse proxy connection for the Datastream CDC. The subnet must be configured in the same region as the Dataproc Metastore service.BUCKET_NAME
: the Cloud Storage path to store CDC data during the migration.ROOT_PATH
: the root path inside the Cloud Storage bucket. The stream event data is written to this path.
Complete migration
When you complete a migration, Dataproc Metastore connects to Spanner and starts to use Spanner as its backend database.
A complete migration triggers the following state changes:
- Dataproc Metastore moves back to the
ACTIVE
state. The migration execution state moves to
SUCCEEDED
.
Console
In the Google Cloud console, open the Dataproc Metastore page.
At the top of the page, click Migrate Data.
The Migrate Data page opens and displays your completed managed migrations.
REST
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type:application/json" \
-X POST -d '' \
https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:completeMigration
Replace the following:
SERVICE
: the name or ID of your Dataproc Metastore service.PROJECT_ID
: the project ID of the Google Cloud project your Dataproc Metastore service resides in.LOCATION
: the Google Cloud region in which your Dataproc Metastore service resides.
Cancel migration
When you cancel a migration, Dataproc Metastore reverts any changes and starts using the Spanner database type as it's backend database. Any data that was transferred during the migration is deleted.
A cancel migration triggers the following state changes:
- Dataproc Metastore moves back to the
ACTIVE
state. The migration execution state moves to
CANCELLED
.
Console
In the Google Cloud console, open the Dataproc Metastore page.
At the top of the page, click Migrate Data.
The Migrate Data page opens and displays your canceled managed migrations.
REST
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type:application/json" \
-X POST -d '' \
https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE:cancelMigration
Replace the following:
SERVICE_NAME
: the name or ID of your Dataproc Metastore service.PROJECT_ID
: the project ID of the Google Cloud project your Dataproc Metastore service resides in.LOCATION
: the Google Cloud region in which your Dataproc Metastore service resides.
Get migration details
Get details about a single managed migration.
Console
In the Google Cloud console, open the Dataproc Metastore page.
At the top of the page, click Migrate Data.
The Migrate Data page opens and displays your managed migrations.
To get more migration details, click the name of a managed migration.
REST
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-X GET \
https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID
Replace the following:
SERVICE
: the name or ID of your Dataproc Metastore service.PROJECT_ID
: the project ID of the Google Cloud project your Dataproc Metastore service resides in.LOCATION
: the Google Cloud region in which your Dataproc Metastore service resides.MIGRATION_ID
: the name or ID of your Dataproc Metastore migration.
List migrations
List managed migrations.
Console
In the Google Cloud console, open the Dataproc Metastore page.
At the top of the page, click Migrate Data.
The Migrate Data page opens and displays your managed migrations.
Verify that the command listed the migrations.
REST
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-X GET \
https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID
Replace the following:
SERVICE
: the name or ID of your Dataproc Metastore service.PROJECT_ID
: the project ID of the Google Cloud project your Dataproc Metastore service resides in.LOCATION
: the Google Cloud region in which your Dataproc Metastore service resides.
Delete migrations
Delete managed migrations.
Console
In the Google Cloud console, open the Dataproc Metastore page.
At the top of the page, click Migrate Data.
The Migrate Data page opens and displays your managed migrations.
Select the migration and click Delete.
REST
curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-X DELETE \
https://metastore.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/services/SERVICE/migrationExecutions/MIGRATION_ID
Replace the following:
SERVICE
: the name or ID of your Dataproc Metastore service.PROJECT_ID
: the project ID of the Google Cloud project your Dataproc Metastore service resides in.LOCATION
: the Google Cloud region in which your Dataproc Metastore service resides.MIGRATION_ID
: the name or ID of the Dataproc Metastore migration.