NLP2SQL using dynamic RAG based few-shot examples

SKJB · 09-16-2024 04:10 PM

Editor’s note: This is the second post in a three-part series on how generative AI is changing how we interact with databases. This post and part 3 specific approaches and architecture for building an NLP2SQL solution. Make sure to check out part 1 for introduction to NLP2SQL.

Generative AI is transforming the way we interact with the technology, and one of such use cases is natural language processing to SQL (NLP2SQL) command generation.

NLP2SQL enables you to retrieve information and generate insights from structured database tables using natural language query. Large language models will leverage the information about the database as context to translate the natural language query to a SQL.

NLP2SQL allows for seamless interaction with databases through natural language, enabling users to retrieve information in human-like conversations. This unlocks the opportunities not only for individuals with no or limited SQL expertise to delve into dataset like a professional, but also for analysts to generate queries to enhance overall productivity. However, Generative AI seems to have a major challenge of “Hallucination”, presenting significant obstacle for NLP2SQL use cases to generate results with:

Higher accuracy and reliability
Domain or use case specific best practices
Continuous learning & improvement of the model
Handling SQL Dialects

While there are different approaches to generate SQL from free form text using Large Language Models, in this article, the focus is on an approach using Dynamic RAG based Few Shot Examples

Before we start exploring the NLP2SQL custom approach in more detail, refer to the “Prompt Design - Best Practices” and “Getting Started with Text Embeddings + Vertex AI Vector Search” colab notebooks to better understand the basic concept related to prompt design, few shot examples, and retrieval augmented generation (RAG) pipelines.

Architecture

“Dynamic RAG based few shot learning” is a combination of “RAG” and “Few shot examples” techniques that are commonly used in any custom Generative AI application development.

In this approach, the RAG pipeline is employed to dynamically retrieve all the relevant few-shot examples for the prompt from the knowledge base by utilizing the similarity score computed against the input query for each entry.

In accordance with the established threshold for similarity scores, the integration of a limited number of illustrative examples into prompts will serve as a reference for large language models (LLMs). This approach enables the identification of patterns and best practices that are highly relevant to the specific input query, as opposed to relying on generic examples that attempt to address a broad range of scenarios. This in turn increases the overall accuracy and reliability of the model. In addition, metadata from the knowledge base can be utilized to capture the SQL query templates, which serve as a grounding for each response.

Note: Prompt engineering is not included in the scope of this document; it is assumed that context setting & prompt design is already handled based on the domain of the use case.

Architecture

The dynamic RAG-based few-shot example architecture can be formally divided into three distinct components: Creating and Managing Knowledge Base", "Similarity Search and SQL Generation," and "Database Output in Natural Language." Each component plays a crucial role in the overall functionality of the architecture, and we will explore each section in more detail.

Creating and managing knowledge base

This module includes the following components to create and update the knowledge base. The knowledge base would include:

The metadata of the tables, columns and relationships like foreign key, primary key, and so on.
Key value pairs of natural language query and ground truth SQL query.

Components to create and update the knowledge base include:

Batch to process and store metadata and embeddings from CSV file into BigQuery or AlloyDB.
Cloud Function to create and use embedding of the natural language query to retrieve top-k matching results from knowledge base.

Similarity search & SQL generation

This module using large language models to use examples retrieved from the knowledge base and prompt templates to generate a SQL for a given natural language query.
The generated query is logged in the knowledge base for human evaluation.

DB Output in Natural Language

The generated query is executed and the results of the SQL query is summarized using Large Language Model

Products used

The aforementioned architectural design of "Dynamic RAG based few shot examples" presents the following components that can be harnessed during the implementation of these solutions. The same architecture can be implemented with other equivalent GCP and OSS services as well.

Cloud Storage: Cloud Storage is a managed service designed to store unstructured data. It offers the flexibility to store any quantity of data and retrieve it on an as-needed basis, ensuring efficient utilization of storage and data retrieval resources.
BigQuery: BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse and allows running analytics over vast amounts of data in near real time. BigQuery now also supports a vector search feature which can be used to search embeddings to identify semantically similar entities.
Vertex AI Vector Search: Vector Search harnesses the capabilities of vector search technology to explore and retrieve a vast corpus of semantically similar or related items with remarkable efficiency.
Vertex Generative AI models: Vertex AI is a fully-managed, unified AI development platform for building and using generative AI. These models include the widest variety of models with first-party (Gemini, Imagen, Codey), third-party (Anthropic's Claude 3), and open models (Gemma, Llama 3) in Model Garden.
Cloud Function: Run your code in the cloud with no servers or containers to manage with our scalable, pay-as-you-go functions as a service (FaaS) product.

Few alternative services that also can be explored based on the complexity:

GCP AlloyDB for PostgreSQL: AlloyDB AI provides high-performance, pgvector-compatible search that executes vector queries up to 10x faster compared to standard PostgreSQL. It enables the generation of vector embeddings from within the database and accesses data stored in open-source gen AI tools.
Cloud Run: Run frontend and backend services, batch jobs, deploy websites and applications, and queue processing workloads without the need to manage infrastructure.

Implementation using Google generative AI

NLP2SQL using GCP Foundation models & services (Code link)

GenAI Model used:

# Load the code & text embeddings model
embedd_model = TextEmbeddingModel.from_pretrained("textembedding-gecko-multilingual@001")
gemini_model = GenerativeModel("gemini-1.0-pro")
code_model = CodeGenerationModel.from_pretrained("code-bison")

In this sample workflow, the multilingual embedding model "textembedding-gecko-multilingual" is utilized to demonstrate how a knowledge base developed in one language can be used to support the response to input queries in all other languages that are supported by the model. Even with the usage of "textembedding-gecko" we can support the knowledge base and queries that only require English language support.

Prompt Template for sample use case:

table = """CREATE TABLE transactions (
  Cust_ID VARCHAR(255),
  Date DATE,
  Time TIME,
  Transaction_Description VARCHAR(255),
  Value FLOAT,
  Fuel VARCHAR(255),
  Foreign_Currency VARCHAR(255),
  Smartbuy VARCHAR(255),
  Points_Accrued FLOAT
)
"""
# Output response is requested in english to avoid and localized content generation in SQL
# Since current DB used just have entries in English locale
response = code_model.predict(
    prefix = f"""Write a SQL Query based on given schema details for the given question and context in English only
CONTEXT:
{table}
EXAMPLES: {similar_query}
INPUT:{text_query_test}""",
    **code_parameters
)

The aforementioned prompt template serves as a mere demonstration of how prompts can be crafted for a specific database type. However, based on the specific use case and the level of database support required, the prompt may necessitate further optimization by incorporating additional details and constraints.

Design considerations

This section will examine some of the current challenges and limitations that exist, along with several approaches that can be investigated to alleviate them, in order to achieve a higher overall level of precision in the SQL or database code that is produced.

Current challenges and limitations

Associated with any solution or method are its own set of challenges and limitations. Certain challenges pertain to implementation while others are technical in nature. In this instance, the following are two high-level challenges and limitations of the solution:

Knowledge base:

The current approach needs a knowledge base of frequently asked questions and corresponding SQL commands with/without metadata information.

Technical limitations to support multiple tables or Data warehouse:

The current demo and workflow just covers a database with a single table. In production scenarios, mostly information will be spread around multiple tables or data warehouses.

A knowledge base functions as a repository of grounding for all the generated responses. Moreover, it enhances the overall performance and accuracy of the solution.. It serves as a foundation and provides best practices for Large Language Models (LLMs) to reference when generating the final Structured Query Language (SQL) command. Due to the complexities of natural language processing, it is challenging to address all scenarios comprehensively from the outset. Incorporating humans in a loop as part of the knowledge base can mitigate this issue over the long term. A default set of few-shot learning examples can be utilized in instances where valid similarity results are not identified. Subsequently, humans in a loop can further refine the generated results, which can be incorporated into the knowledge base to optimize future outcomes related to similar queries.

Working on multiple tables presents a technical limitation because a question can be asked in various ways without providing any context to the LLM model regarding the correlation between the tables and schema information. Additionally, including all the table and schema information in the context window increases the overall input context length and, in some models, exceeds the maximum input token lengths. Several approaches can be explored in such cases, and the following section explores that topic in greater detail.

Support multiple tables or data warehouse

A few approaches to support NLP2SQL for a multiple table database or a complete data warehouse:

Generating a persona based database table

In this methodology, a consolidated database is generated by fetching all the relevant information pertaining to the intended user profile from various tables or data repositories, leveraging data processing and transformation methodologies.

This approach helps:

Reduce the data leakage risk, considering only subset of user profile based information is not accessible
Reduce the input context size, in turns reduce cost and latency
Improve the accuracy of the response, since all the data is available in single or a couple of tables.

Multi-level RAG based approach

In this methodology, we utilize multiple tiers of RAG pipelines to accurately identify relevant few-shot examples aligned with the input query. Subsequently, we retrieve pertinent databases that exhibit compatibility with both the few-shot examples and the input query.

This approach helps in cases where further data processing and transformations are not possible to create a consolidated database based on user profile.

Foundation models + SDP API for PII dataset

Within the context of certain applications, a subset of application data can contain Personally Identifiable Information (PII). Any PII might necessitate masking in specific applications and use cases, based on regional compliance guidelines and requirements related to model or data residency.

In such cases, “Sensitive Data Protection” or previously known as “Cloud Data Loss Prevention” (Cloud DLP) APIs can be employed to mask PII data effectively prior to transferring it to a model for further processing. We can employ de-masking of the processed data to re-identify the masked content through secure key post-processing based on the use case.

Deployment

The above solution can be packaged and deployed based on the integration inference necessitated by the application under development. Several options have been proposed in the architecture outlined above:

The solution can be deployed as a Cloud function endpoint and integrated with the application via REST API or via webhook for Dialogflow CX.
The entire solution can be packaged as a container and deployed as an independent backend service in Cloud Run or Pod in the Google Kubernetes Engine (GKE)-based application deployment.

What's next

Go beyond traditional approaches and explore the potential of Multi-agent architectures! Dive into latest blog post from one of our colleague, "Unlocking Insights — Natural Language QnA on Structured Data" through a Multi-agent Architecture" to discover a novel approach to tackling this NLP2SQL challenge.

Assess your AI capabilities and create a roadmap to harness its potential with the AI Readiness Workshop.
Bring generative AI to real-world experiences through Generative AI on Google Cloud Workshop facilitated by our trusted partners.
Explore Google Cloud Consulting generative AI service offerings designed to get you up and running fast.
Learn what generative AI is, the business value that you can expect to deliver, and gather best practices from Google Cloud's AI experts with the Executive's guide to generative AI.
Learn about Google Cloud's approach to responsible AI.
Learn how to use Google Cloud to help prevent AI hallucinations.
Learn about AI Explanations.
Explore and learn how you can supercharge security with AI using Google Cloud
Learn about Google's Secure AI Framework (SAIF), a conceptual framework to secure AI systems.
Learn more by reading Generative AI FAQs.

Contributors

Author:

Sunil Kumar Jang Bahadur | Customer Engineer, AI & GenAI Specialist
Vijay Surampudi | Consultant Specialized in AI/ML

Other contributors:

Kanchana Patlolla | Technical Solutions Manager - Generative AI
Michael Chi | Cloud Solution Architect