VAIS:Retail onboarding - user events ingestion best practices guide

sekhrivijay1

VAIS:Retail's data ingestion pipeline encompasses both product catalog and user event data. This data stream provides the foundation for robust model training and continuous evaluation through feedback mechanisms. Accurate and complete data ingestion is not just a prerequisite, it's an ongoing process essential for maintaining the adaptability of the underlying models. This, in turn, directly influences the quality and relevance of search results, offering significant returns on investment.

Consider these data ingestion best practices when architecting your retail search solution to maximize efficiency and effectiveness.

User event ingestion

Please note that the following content is not referring to historic events ingestion. For more info on historic events ingestion, please see here.

1. User events ingestion in VAIS:Retail: balancing real-time responsiveness and batch efficiency

Mirroring the catalog ingestion process, VAIS:Retail offers dual mechanisms for user event data: bulk import and real-time streaming. This flexibility caters to diverse customer backend architectures. However, unlike the catalog where a hybrid approach is feasible, a dedicated ingestion strategy is strongly recommended for user events, typically favoring either bulk import or real-time streaming, with the latter being the predominant choice in practical implementations.

While both methods, assuming accurate and complete data, yield comparable outcomes in model training, KPI measurement, and revenue optimization, subtle trade-offs exist between the two. Bulk import, for instance, might offer greater efficiency for processing large volumes of historical data, while real-time streaming ensures immediate responsiveness to recent user interactions.

The ideal choice ultimately depends on the specific requirements of your retail environment, such as the desired latency for incorporating user events into model training and the volume of events being generated.

More details about bulk import and real time events streaming can be found here.

2. Real-time streaming: nuances and advantages

While both bulk import and real-time streaming offer effective mechanisms for user event ingestion in VAIS:Retail, real-time streaming presents several nuanced advantages:

Simplified scalability: Real-time streaming often negates the need for additional ETL (Extract, Transform, Load) pipelines, facilitating easier scalability as event volumes increase. The infrastructure can typically handle growth organically without major adjustments.
Seamless integration with Google Analytics (GA4): Websites instrumented with GA4 can leverage Google Tag Manager (GTM) to seamlessly capture and stream real-time events into VAIS:Retail, eliminating the need for custom event tracking implementations.
Direct API calls: Real-time events can be transmitted directly from the frontend or through a simple proxy server via REST APIs, streamlining the integration process and reducing potential points of failure.
Up-to-date KPI measurement and error reporting: Faster processing of real-time events translates to more current KPI measurements and error reporting. This allows for quicker identification and resolution of issues, enhancing the overall robustness of the system.
Unlocking real-time personalization: By capturing and processing user events in real time, VAIS:Retail can deliver personalized search results based on recent browsing and purchase behavior, creating a more engaging and tailored shopping experience for users.

Potential downsides of real-time streaming:

Limited re-ingestion: Unlike bulk imports, where data can be staged in BigQuery or cloud storage, real-time events are not easily re-ingested in case of errors or malformed data. This necessitates robust validation mechanisms at the ingestion point to ensure data integrity.
Challenges with custom analysis: If custom analysis of event data is required, it often necessitates exporting events to BigQuery for further processing. This can become a bottleneck for high-volume event streams, potentially slowing down analysis and reporting.
Limited retrospective debugging and event data transfer: Sending the event directly without staging limits the ability to do any retrospective forensics on the events that are already ingested.
Another potential limitation is that training the models in a lower environment may not have sufficient events . Having a staging area for the events data (BQ or GCS) builds the ability to also import or transfer the events in other environments too, albeit after minor data transformations if necessary.

Understanding these nuances can empower retailers to make informed decisions about the most suitable user event ingestion strategy for their specific needs and priorities within the VAIS:Retail ecosystem.

3. Bulk Import via BigQuery (or GCS): resiliency and analytics advantages

Leveraging BigQuery (or GCS) as a staging area for user event data in VAIS:Retail offers distinct advantages:

Enhanced resiliency: Storing events in BQ (or GCS) provides a reliable backup mechanism, enabling easy purging and re-ingestion if necessary. This resiliency safeguards against data loss and simplifies recovery in case of errors or inconsistencies.
There is also built-in resiliency in the import method where events that were not ingested are stored in error buckets along with the error details.
In-place custom analytics: With events readily accessible in BQ, custom analytics can be performed directly on the user event data without the need for additional data export or transfer processes. This streamlines analysis workflows and facilitates real-time insights.
Leveraging existing events: Bulk imports can leverage existing user event data collected in various formats. A straightforward ETL (Extract, Transform, Load) process can convert this data into the VAIS:Retail format, eliminating the need for extensive frontend changes or complex integrations.

Potential downsides of bulk import:

Limited real-time personalization: Real-time personalization capabilities are constrained by the frequency of bulk imports. The time lag between event generation and ingestion can impact the responsiveness of personalized search results.
Slower KPI measurement and error reporting: Compared to real-time streaming, bulk imports introduce delays in KPI measurement and error reporting due to the batch-oriented nature of the process. This can hinder immediate responses to emerging trends or issues.
ETL pipeline infrastructure: Compared to real-time streaming, ETL pipelines need to be built and monitored for failures. Mechanism to re-try imports for failed events (after fixing) also needs to be implemented. Implementing this might need some initial development efforts.

Understanding these trade-offs can guide retailers in selecting the most suitable user event ingestion approach for their specific use cases and priorities within VAIS:Retail.

4. Scaling user event ingestion in VAIS:Retail: preparing for traffic surges and ensuring data integrity

After establishing your chosen user event ingestion method and pipeline, proactive planning for scaling scenarios is paramount.

High-traffic events like Black Friday and Cyber Monday (BFCM) can trigger a 10x or even 20x surge in user activity compared to average daily levels.
Ensuring sufficient quotas and the scalability of your ingestion system to handle such spikes is crucial. These events often manifest as sudden, planned bursts of traffic rather than gradual increases, making preparedness even more critical.
Missing events during these peak periods can significantly hamper model training, degrade search performance, and skew KPI measurements. Debugging such issues can be challenging as events form the basis for both KPI tracking and general troubleshooting.
Implementing robust alerting mechanisms is essential. These alerts can proactively notify you of deteriorating data quality, which is often a consequence of missing or erroneous event data.

By anticipating these scenarios and taking preventative measures, you can maintain the reliability and accuracy of your user event data, even under extreme load conditions. This ensures that your VAIS:Retail system continues to deliver optimal performance, accurate analytics, and a seamless user experience during peak traffic periods.

5. Reference architecture for batch events ingestion

Reference Architecture for batch events ingestionThis design details a robust, scalable architecture designed for the efficient ingestion of user events into Vertex AI Search for Retail (VAIS:R). The architecture leverages a combination of Google Cloud Platform (GCP) services, including Pub/Sub, Dataflow, BigQuery, Cloud Workflows, and Cloud Storage, to manage the ingestion process in a staged, controlled manner.

Architectural overview

The architecture employs a multi-stage approach to ensure the reliable and accurate transfer of user event data into VAIS:R. Key components include:

Pub/Sub: Acts as the initial entry point for user events, providing a scalable and durable messaging system.
Dataflow (Streaming Events): Continuously reads raw user events from Pub/Sub and writes them into BigQuery raw event tables, also capturing any failed events for debugging.
BigQuery: Serves as the primary data warehousing solution, storing raw events, transformed events, and various metadata related to the ingestion process.
Cloud Workflows: Orchestrates the hourly batch processing of raw events, ensuring data integrity and facilitating error handling.
Cloud Storage: Provides temporary storage for dataflow during processing and for archiving failed event logs.
VAIS:R (Retail API): The final destination for user event data, enabling advanced search and recommendation capabilities.

Step-by-step data flow

Event streaming and Raw data persistence:
- User events are published to Pub/Sub topics from the customer source system.
- A Dataflow streaming pipeline (Streaming Events) continuously reads events from Pub/Sub.
- Successful Raw events are written into BigQuery raw event tables.
- Any failures during this initial ingestion are captured and stored in separate BigQuery Raw failed tables for analysis and troubleshooting.
Hourly batch processing and transformation:
- Cloud Workflows triggers an hourly scheduled Dataflow pipeline (Incr Batch Events).
- This pipeline reads raw events from BigQuery tables.
- Events are transformed into the required VAIS:R format.
- Transformed events are written into BigQuery curated event tables.
- Any transformation failures are captured in BigQuery curated failed event tables.
Data validation and preparation:
- Cloud Workflows executes a BigQuery stored procedure to create or update a view (Incr Update Event View) that reflects the latest transformed event data from the last hour.
- The workflow then performs a validation check on this view, ensuring that the number of transformed events falls within predefined thresholds.
VAIS:R import:
- If the validation is successful, Cloud Workflows invokes the VAIS:R import event API, pointing it to the BigQuery view containing the transformed events.
- VAIS:R then imports these events for further processing and indexing.

Benefits of this architecture

Scalability: The use of Pub/Sub and Dataflow enables the system to handle large volumes of user events, ensuring smooth operation even during peak periods.
Reliability: Multiple stages of data persistence and error handling mechanisms (failed event tables) ensure that data loss is minimized and that issues can be quickly identified and addressed.
Data integrity: Data validation checks before the final import into VAIS:R ensure that only high-quality, correctly formatted data is ingested.
Flexibility: The modular design allows for easy adaptation to changing requirements or the addition of new data sources or processing steps.
Maintainability: The use of managed GCP services reduces operational overhead and simplifies ongoing maintenance.

Additional considerations

Security: Appropriate security measures, such as access controls and encryption, should be implemented to protect sensitive user event data.
Monitoring and logging: Comprehensive monitoring and logging should be set up to track the performance of the system and identify potential issues proactively.
Cost optimization: Resource utilization should be monitored to optimize costs and ensure efficient use of GCP services.

This architecture provides a solid foundation for ingesting user events into VAIS:R. By leveraging the strengths of various GCP services and incorporating best practices for data processing and error handling, it enables organizations to build robust, scalable, and maintainable solutions for powering advanced search and recommendation experiences.