Google Data Cloud innovations for continuous real-time intelligence
Sachin Agarwal
Group Product Manager, Google Cloud
Shan Kulandaivel
Group Product Manager, Google Cloud
Organizations are increasingly looking to drive outcomes by harnessing real-time analytics. In the current AI era, it is crucial to deliver up-to-date information to AI systems that help make informed decisions, identify trends and anomalies, and implement proactive and effective interventions. To fully realize the benefits of real-time intelligence for visibility, predictions, and activation, you need to implement streaming infrastructure that is easy-to-use, robust, scalable, and cost efficient.
Google invented modern stream data processing when we published the original Dataflow paper describing our Dataflow service. The unique way Dataflow implements concepts such as windowing, triggers, checkpointing, and more, ensures the continued processing of all kinds of data, including late-arriving data. Google has been named a Leader in the Forrester Wave™ Streaming Data Platforms, Q4 2023 report. Principal Analyst at SanjMo and Former Gartner VP, Sanjeev Mohan also recognized how Dataflow is well integrated with many other Google Cloud products to provide a full platform for real-time applications.
Using Google Cloud’s data, AI and real-time solutions, many enterprises are delivering and actioning real-time insights to drive significant business impact:
-
Spotify leverages Dataflow for large-scale generation of ML podcast previews and plans to keep pushing the boundaries of what’s possible with data engineering and data science to build better experiences for their customers and creators.
-
Puma increased their average order value by 19% by better understanding how to tailor content to customers and with access to real-time inventory levels up to 4x faster, helping shoppers find the right products at the nearest stores.
-
Compass works with local governments in Australia to improve the safety of their roads with real-time monitoring across 1.5M+ datasets processed daily from connected vehicles.
-
Tyson Foods is using Google Cloud for the next generation of smart factories using unstructured data, such as images or videos, to train vision models to monitor real-time IoT-connected sensors and optimize patterns. They rely on BigQuery for secure, repeatable, and scalable enterprise solutions.
Over the years, we’ve extended our streaming capabilities and democratized access to streaming in a number of ways. This includes enhancements to Dataflow providing flexibility over GPU and CPU usage, pipeline enrichment in real-time, new managed IO services and at-least once processing; new capabilities in BigQuery with continuous real-time query processing integrated with AI, and a new Apache Kafka service.
Dataflow innovations
We added new features in Dataflow ML to make the most common machine learning use cases easier, more performant, and more cost effective. Dataflow’s new right fitting allows users to mix-and-match compute types to only use GPUs when necessary, reducing cost. The new Enrichment transform provides real-time ML feature enrichment that gracefully handles spikes and unexpected behavior in a Dataflow pipeline, reducing toil and accelerating your ability to leverage the latest data in your ML models.
The new IcebergIO connector streams data directly into Apache Iceberg data lake tables. IcebergIO is the first of many IOs that will be taking advantage of the new Managed IO feature in Dataflow. Dataflow Managed IO provides additional benefits like automatically updating the connector with newer versions or applying patches without any action required.
Dataflow streaming provides an exactly-once guarantee, meaning that the effects of data processed in the pipeline are reflected exactly once, even for late data. For lower latency and lower cost streaming data ingestion, we introduced the new at-least-once processing in which the input record is processed at least once - which is particularly helpful when the data source already provides those guarantees.
BigQuery continuous queries
At Next ‘24, we announced the preview of continuous queries in BigQuery. Leveraging the infrastructure and techniques that power Dataflow, users can now directly create stream processing jobs to create real-time change streams based on the latest data coming into BigQuery. In addition, these real-time streams can be operated on with any AI or ML functions, including LLM operations using Vertex AI. Customers can do this with simple SQL, dramatically lowering the barrier for organizations and users to realize the benefits of real-time intelligence and streaming infrastructure.
At Next ‘24, we also extended support for all three major open source data lake formats in BigLake, including Apache Iceberg, Apache Hudi and Delta Lake natively integrated with BigQuery. This includes a fully managed experience for Iceberg, enabling support for streaming across all data types and even across clouds using BigQuery Omni. We also released a new whitepaper, BigQuery's Evolution toward a Multi-Cloud Lakehouse, which is to be presented at the 2024 SIGMOD event.
New Apache Kafka service
Finally, at Next 2024, we announced the forthcoming release of a managed Apache Kafka service called Apache Kafka for BigQuery. This is a full end-to-end managed service for Apache Kafka that will automate operational and security work that comes with running such a service yourself. It is compatible with your existing applications and integrated into BigQuery to facilitate quick and easy loading of your Kafka streaming data into BigQuery via BigQuery’s high performance streaming ingest called the Storage Write API. You can express interest to be notified about the preview.
Getting started
Refer to the documentation to learn more about Dataflow and BigQuery. If you are new to Dataflow, take the foundational training. We’re very excited to bring you all the latest innovations and can’t wait to see what you build with our real-time analytics solutions.