Create a Hadoop cluster
You can use Dataproc to create one or more Compute Engine instances that can connect to a Bigtable instance and run Hadoop jobs. This page explains how to use Dataproc to automate the following tasks:
- Installing Hadoop and the HBase client for Java
- Configuring Hadoop and Bigtable
- Setting the correct authorization scopes for Bigtable
After you create your Dataproc cluster, you can use the cluster to run Hadoop jobs that read and write data to and from Bigtable.
This page assumes that you are already familiar with Hadoop. For additional information about Dataproc, see the Dataproc documentation.
Before you begin
Before you begin, you'll need to complete the following tasks:
- Create a Bigtable instance. Be sure to note the project ID and Bigtable instance ID.
-
Enable the Cloud Bigtable API, Cloud Bigtable Admin API, Dataproc, and Cloud Storage JSON APIs.
- Verify that your user account is in a role that includes the permission
storage.objects.get
.Open the IAM page in the Google Cloud console.
- Install the Google Cloud CLI. See the gcloud CLI setup instructions for details.
-
Install Apache Maven, which is used to run a sample Hadoop job.
On Debian GNU/Linux or Ubuntu, run the following command:
sudo apt-get install maven
On RedHat Enterprise Linux or CentOS, run the following command:
sudo yum install maven
On macOS, install Homebrew, then run the following command:
brew install maven
- Clone the GitHub repository
GoogleCloudPlatform/cloud-bigtable-examples,
which contains an example of a Hadoop job that uses Bigtable:
git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git
Create a Cloud Storage bucket
Dataproc uses a Cloud Storage bucket to store temporary files. To prevent file-naming conflicts, create a new bucket for Dataproc.
Cloud Storage bucket names must be globally unique across all buckets. Choose a bucket name that is likely to be available, such as a name that incorporates your Google Cloud project's name.
After you choose a name, use the following command to create a new bucket, replacing values in brackets with the appropriate values:
gcloud storage buckets create gs://[BUCKET_NAME] --project=[PROJECT_ID]
Create the Dataproc cluster
Run the following command to create a Dataproc cluster with four worker nodes, replacing values in brackets with the appropriate values:
gcloud dataproc clusters create [DATAPROC_CLUSTER_NAME] --bucket [BUCKET_NAME] \
--region [region] --num-workers 4 --master-machine-type n1-standard-4 \
--worker-machine-type n1-standard-4
See the gcloud dataproc clusters create
documentation
for additional settings that you can configure. If you get an error message that
includes the text Insufficient 'CPUS' quota
, try setting the --num-workers
flag to a lower value.
Test the Dataproc cluster
After you set up your Dataproc cluster, you can test the cluster by running a sample Hadoop job that counts the number of times a word appears in a text file. The sample job uses Bigtable to store the results of the operation. You can use this sample job as a reference when you set up your own Hadoop jobs.
Run the sample Hadoop job
- In the directory where you cloned the GitHub repository, change to the
directory
java/dataproc-wordcount
. Run the following command to build the project, replacing values in brackets with the appropriate values:
mvn clean package -Dbigtable.projectID=[PROJECT_ID] \ -Dbigtable.instanceID=[BIGTABLE_INSTANCE_ID]
Run the following command to start the Hadoop job, replacing values in brackets with the appropriate values:
./cluster.sh start [DATAPROC_CLUSTER_NAME]
When the job is complete, it displays the name of the output table, which is the
word WordCount
followed by a hyphen and a unique number:
Output table is: WordCount-1234567890
Verify the results of the Hadoop job
Optionally, after you run the Hadoop job, you can use the
cbt
CLI
to
verify that the job ran successfully:
-
Open a terminal window in Cloud Shell.
- Install the
cbt
CLI :gcloud components update
gcloud components install cbt
- Scan the output table to view the results of the Hadoop job, replacing
[TABLE_NAME]
with the name of your output table:cbt -instance [BIGTABLE_INSTANCE_ID] read [TABLE_NAME]
Now that you've verified that the cluster is set up correctly, you can use it to run your own Hadoop jobs.
Delete the Dataproc cluster
When you are done using the Dataproc cluster, run the following
command to shut down and delete the cluster, replacing [DATAPROC_CLUSTER_NAME]
with the name of your Dataproc cluster:
gcloud dataproc clusters delete [DATAPROC_CLUSTER_NAME]
What's next
- Learn more about Dataproc.
- Get started with the HBase client for Java.