Limit forecasted values for a time series model

This tutorial teaches you how to use limits to narrow the forecasted results returned by a time series model. In this tutorial, you create two time series models over the same data, one model which uses limits and one model that doesn't use limits. This lets you compare the results returned by the models and understand the difference that specifying limits makes.

You use the new_york.citibike_trips data to train the models in this tutorial. This dataset contains information about Citi Bike trips in New York City.

Before following this tutorial, you should be familiar with single time series forecasting. Complete the Single time series forecasting from Google Analytics data tutorial for an introduction to this topic.

Required Permissions

  • To create the dataset, you need the bigquery.datasets.create IAM permission.
  • To create the connection resource, you need the following permissions:

    • bigquery.connections.create
    • bigquery.connections.get
  • To create the model, you need the following permissions:

    • bigquery.jobs.create
    • bigquery.models.create
    • bigquery.models.getData
    • bigquery.models.updateData
    • bigquery.connections.delegate
  • To run inference, you need the following permissions:

    • bigquery.models.getData
    • bigquery.jobs.create

For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.

Objectives

In this tutorial, you use the following:

  • The CREATE MODEL statement: to create a time series model.
  • The ML.FORECAST function: to forecast daily total visits.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the BigQuery API.

    Enable the API

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the BigQuery API.

    Enable the API

Create a dataset

Create a BigQuery dataset to store your ML model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

      The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

      Create dataset page.

Visualize the time series you want to forecast

Before creating the model, it is useful to see what your input time series looks like.

In the following query, the FROM bigquery-public-data.new_york.citibike_trips clause indicates that you are querying the citibike_trips table in the new_yorkdataset.

In the SELECT statement, the query uses the EXTRACT function to extract the date information from the starttime column. The query uses the COUNT(*) clause to get the daily total number of Citi Bike trips.

#standardSQL
SELECT
   EXTRACT(DATE from starttime) AS date,
   COUNT(*) AS num_trips
FROM
  `bigquery-public-data`.new_york.citibike_trips
GROUP BY date

To run the query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the query editor.

    #standardSQL
    SELECT
       EXTRACT(DATE from starttime) AS date,
       COUNT(*) AS num_trips
    FROM
      `bigquery-public-data`.new_york.citibike_trips
    GROUP BY date
  3. Click Run. The query results similar to the following.

    Query output.

  4. Use the Google Cloud console to chart the time series data. In the Query results pane, click the Chart tab. In the Chart configuration pane, choose Bar for the Chart type:

    Result_visualization.

Create a time series model

Create a time series model, using the NYC Citi Bike trips data.

The following GoogleSQL query creates a model that forecasts daily total bike trips. The CREATE MODEL statement creates and trains a model named bqml_tutorial.nyc_citibike_arima_model.

#standardSQL
CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model
  OPTIONS (
    model_type = 'ARIMA_PLUS',
    time_series_timestamp_col = 'date',
    time_series_data_col = 'num_trips',
    time_series_id_col = 'start_station_id')
AS
SELECT
  EXTRACT(DATE FROM starttime) AS date,
  COUNT(*) AS num_trips,
  start_station_id
FROM
  `bigquery-public-data`.new_york.citibike_trips
WHERE starttime > '2014-07-11' AND starttime < '2015-02-11'
GROUP BY date, start_station_id;

The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...) clause indicates that you are creating an ARIMA-based time series model. By default, auto_arima=TRUE, so the auto.ARIMA algorithm automatically tunes the hyperparameters in ARIMA_PLUS models. The algorithm fits dozens of candidate models and chooses the best one with the lowest Akaike information criterion (AIC). Additionally, because the default is data_frequency='AUTO_FREQUENCY', the training process automatically infers the data frequency of the input time series. The CREATE MODEL statement uses decompose_time_series=TRUE by default, so both the history and forecast parts of the time series are saved in the model. Setting the parameter time_series_id_col = 'start_station_id' causes the model to fit and forecast multiple time series using a single query based on the start_station_id. You can use this information to further understand how the time series is forecasted by fetching the separate time series components such as seasonal periods.

Run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the query editor.

    #standardSQL
    CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model
    OPTIONS (
      model_type = 'ARIMA_PLUS',
      time_series_timestamp_col = 'date',
      time_series_data_col = 'num_trips',
      time_series_id_col = 'start_station_id')
    AS
    SELECT
    EXTRACT(DATE FROM starttime) AS date,
    COUNT(*) AS num_trips,
    start_station_id
    FROM
    `bigquery-public-data`.new_york.citibike_trips
    WHERE starttime > '2014-07-11' AND starttime < '2015-02-11'
    GROUP BY date, start_station_id;
  3. Click Run.

    The query takes approximately 80 seconds to complete, after which your model (nyc_citibike_arima_model) appears in the Explorer pane. Because the query uses a CREATE MODEL statement to create a model, there are no query results.

Forecast the time series and visualize the results

To explain how the time series is forecasted, visualize all the sub-time series components, such as seasonality and trend, using the ML.FORECAST function.

To do this, follow these steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the query editor.

    #standardSQL
    SELECT
    forecast_timestamp AS forecast_timestamp,
    start_station_id AS start_station_id,
    history_value AS history_value,
    forecast_value AS forecast_value
    FROM
    (
      (
         SELECT
         DATE(forecast_timestamp) AS forecast_timestamp,
         NULL AS history_value,
         forecast_value AS forecast_value,
         start_station_id AS start_station_id,
         FROM
         ML.FORECAST(
            MODEL bqml_tutorial.`nyc_citibike_arima_model`,
            STRUCT(
               365 AS horizon,
               0.9 AS confidence_level))
      )
      UNION ALL
      (
         SELECT
         DATE(date_name) AS forecast_timestamp,
         num_trips AS history_value,
         NULL AS forecast_value,
         start_station_id AS start_station_id,
         FROM
         (
            SELECT
               EXTRACT(DATE FROM starttime) AS date_name,
               COUNT(*) AS num_trips,
               start_station_id AS start_station_id
            FROM
               `bigquery-public-data`.new_york.citibike_trips
            WHERE
               starttime > '2014-07-11'
               AND starttime < '2015-02-11'
            GROUP BY
               date_name, start_station_id
         )
      )
    )
    WHERE start_station_id = 79
    ORDER BY
    forecast_timestamp, start_station_id
  3. Click Run. The query results similar to the following:

    BQUI_chart.

  4. Use the Google Cloud console to chart the time series data. In the Query results pane, click the Chart tab:

    Result_visualization.

The chart shows that the forecasted values for the daily total number of Citi Bike trips where start_station_id=79 are negative numbers, which isn't useful. Using a model with limits instead improves the forecasted data.

Create a time series model with limits

Create a time series model with limits, using the NYC Citi Bike trips data.

The following GoogleSQL query creates a model that forecasts daily total bike trips. The CREATE MODEL statement creates and trains a model named bqml_tutorial.nyc_citibike_arima_model_with_limits. The key difference between this model and the model you created previously is the addition of the forecast_limit_lower_bound=0 option. This option causes the model to only forecast values that are greater than 0, based on the values in the column specified by the time_series_data_col argument, in this case num_trips.

#standardSQL
CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model
   OPTIONS (
      model_type = 'ARIMA_PLUS',
      time_series_timestamp_col = 'date',
      time_series_data_col = 'num_trips',
      time_series_id_col = 'start_station_id',
      forecast_limit_lower_bound = 0)
   AS
   SELECT
   EXTRACT(DATE FROM starttime) AS date,
   COUNT(*) AS num_trips,
   start_station_id
   FROM
   `bigquery-public-data`.new_york.citibike_trips
   WHERE starttime > '2014-07-11' AND starttime < '2015-02-11'
   GROUP BY date, start_station_id;

Run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the query editor.

    #standardSQL
    CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model
    OPTIONS (
      model_type = 'ARIMA_PLUS',
      time_series_timestamp_col = 'date',
      time_series_data_col = 'num_trips',
      time_series_id_col = 'start_station_id',
      forecast_limit_lower_bound = 0)
    AS
    SELECT
    EXTRACT(DATE FROM starttime) AS date,
    COUNT(*) AS num_trips,
    start_station_id
    FROM
    `bigquery-public-data`.new_york.citibike_trips
    WHERE starttime > '2014-07-11' AND starttime < '2015-02-11'
    GROUP BY date, start_station_id;
  3. Click Run.

    The query takes approximately 100 seconds to complete, after which your model (nyc_citibike_arima_model_with_limits) appears in the Explorer pane. Because the query uses a CREATE MODEL statement to create a model, there are no query results.

Forecast the time series by using the model with limits

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the query editor.

    #standardSQL
    SELECT
    forecast_timestamp AS forecast_timestamp,
    start_station_id AS start_station_id,
    history_value AS history_value,
    forecast_value AS forecast_value
    FROM
    (
      (
         SELECT
         DATE(forecast_timestamp) AS forecast_timestamp,
         NULL AS history_value,
         forecast_value AS forecast_value,
         start_station_id AS start_station_id,
         FROM
         ML.FORECAST(
            MODEL bqml_tutorial.`nyc_citibike_arima_model`,
            STRUCT(
               365 AS horizon,
               0.9 AS confidence_level))
      )
      UNION ALL
      (
         SELECT
         DATE(date_name) AS forecast_timestamp,
         num_trips AS history_value,
         NULL AS forecast_value,
         start_station_id AS start_station_id,
         FROM
         (
            SELECT
               EXTRACT(DATE FROM starttime) AS date_name,
               COUNT(*) AS num_trips,
               start_station_id AS start_station_id
            FROM
               `bigquery-public-data`.new_york.citibike_trips
            WHERE
               starttime > '2014-07-11'
               AND starttime < '2015-02-11'
            GROUP BY
               date_name, start_station_id
         )
      )
    )
    WHERE start_station_id = 79
    ORDER BY
    forecast_timestamp, start_station_id
    ORDER BY
    forecast_timestamp,start_station_id
  3. Click Run.

    BQUI_chart.

  4. Use the Google Cloud console to chart the time series data. In the Query results pane, click the Chart tab:

    Result_visualization.

The ARIMA PLUS model detects that the daily total number of Citi Bike trips where start_station_id=79 is decreasing. Future forecasting values will follow this trend and give relatively smaller forecasting numbers the farther into the future you go. The chart shows that the forecasted values for the daily total number of Citi Bike trips where start_station_id=79 are positive numbers, which is more useful. The model with limits detects that the daily total number of Citi Bike trips where start_station_id=79 is decreasing, but it still gives meaningful forecasting values.

As this tutorial shows, the forecast_limit_lower_bound and forecast_limit_upper_bound options can help you get more meaningful forecasting values in similar scenarios to the one shown here, such as when forecasting stock prices or future sales numbers.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.

  4. In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next