Editor’s note: This is the second post in a three-part series on how generative AI is changing how we interact with databases. This post and part 3 specific approaches and architecture for building an NLP2SQL solution. Make sure to check out part 1 for introduction to NLP2SQL.
Generative AI is transforming the way we interact with the technology, and one of such use cases is natural language processing to SQL (NLP2SQL) command generation.
NLP2SQL enables you to retrieve information and generate insights from structured database tables using natural language query. Large language models will leverage the information about the database as context to translate the natural language query to a SQL.
NLP2SQL allows for seamless interaction with databases through natural language, enabling users to retrieve information in human-like conversations. This unlocks the opportunities not only for individuals with no or limited SQL expertise to delve into dataset like a professional, but also for analysts to generate queries to enhance overall productivity. However, Generative AI seems to have a major challenge of “Hallucination”, presenting significant obstacle for NLP2SQL use cases to generate results with:
While there are different approaches to generate SQL from free form text using Large Language Models, in this article, the focus is on an approach using Dynamic RAG based Few Shot Examples
Before we start exploring the NLP2SQL custom approach in more detail, refer to the “Prompt Design - Best Practices” and “Getting Started with Text Embeddings + Vertex AI Vector Search” colab notebooks to better understand the basic concept related to prompt design, few shot examples, and retrieval augmented generation (RAG) pipelines.
“Dynamic RAG based few shot learning” is a combination of “RAG” and “Few shot examples” techniques that are commonly used in any custom Generative AI application development.
In this approach, the RAG pipeline is employed to dynamically retrieve all the relevant few-shot examples for the prompt from the knowledge base by utilizing the similarity score computed against the input query for each entry.
In accordance with the established threshold for similarity scores, the integration of a limited number of illustrative examples into prompts will serve as a reference for large language models (LLMs). This approach enables the identification of patterns and best practices that are highly relevant to the specific input query, as opposed to relying on generic examples that attempt to address a broad range of scenarios. This in turn increases the overall accuracy and reliability of the model. In addition, metadata from the knowledge base can be utilized to capture the SQL query templates, which serve as a grounding for each response.
Note: Prompt engineering is not included in the scope of this document; it is assumed that context setting & prompt design is already handled based on the domain of the use case.
The dynamic RAG-based few-shot example architecture can be formally divided into three distinct components: Creating and Managing Knowledge Base", "Similarity Search and SQL Generation," and "Database Output in Natural Language." Each component plays a crucial role in the overall functionality of the architecture, and we will explore each section in more detail.
The aforementioned architectural design of "Dynamic RAG based few shot examples" presents the following components that can be harnessed during the implementation of these solutions. The same architecture can be implemented with other equivalent GCP and OSS services as well.
Few alternative services that also can be explored based on the complexity:
NLP2SQL using GCP Foundation models & services (Code link)
GenAI Model used:
# Load the code & text embeddings model
embedd_model = TextEmbeddingModel.from_pretrained("textembedding-gecko-multilingual@001")
gemini_model = GenerativeModel("gemini-1.0-pro")
code_model = CodeGenerationModel.from_pretrained("code-bison")
In this sample workflow, the multilingual embedding model "textembedding-gecko-multilingual" is utilized to demonstrate how a knowledge base developed in one language can be used to support the response to input queries in all other languages that are supported by the model. Even with the usage of "textembedding-gecko" we can support the knowledge base and queries that only require English language support.
Prompt Template for sample use case:
table = """CREATE TABLE transactions (
Cust_ID VARCHAR(255),
Date DATE,
Time TIME,
Transaction_Description VARCHAR(255),
Value FLOAT,
Fuel VARCHAR(255),
Foreign_Currency VARCHAR(255),
Smartbuy VARCHAR(255),
Points_Accrued FLOAT
)
"""
# Output response is requested in english to avoid and localized content generation in SQL
# Since current DB used just have entries in English locale
response = code_model.predict(
prefix = f"""Write a SQL Query based on given schema details for the given question and context in English only
CONTEXT:
{table}
EXAMPLES: {similar_query}
INPUT:{text_query_test}""",
**code_parameters
)
The aforementioned prompt template serves as a mere demonstration of how prompts can be crafted for a specific database type. However, based on the specific use case and the level of database support required, the prompt may necessitate further optimization by incorporating additional details and constraints.
This section will examine some of the current challenges and limitations that exist, along with several approaches that can be investigated to alleviate them, in order to achieve a higher overall level of precision in the SQL or database code that is produced.
Associated with any solution or method are its own set of challenges and limitations. Certain challenges pertain to implementation while others are technical in nature. In this instance, the following are two high-level challenges and limitations of the solution:
The current approach needs a knowledge base of frequently asked questions and corresponding SQL commands with/without metadata information.
The current demo and workflow just covers a database with a single table. In production scenarios, mostly information will be spread around multiple tables or data warehouses.
A knowledge base functions as a repository of grounding for all the generated responses. Moreover, it enhances the overall performance and accuracy of the solution.. It serves as a foundation and provides best practices for Large Language Models (LLMs) to reference when generating the final Structured Query Language (SQL) command. Due to the complexities of natural language processing, it is challenging to address all scenarios comprehensively from the outset. Incorporating humans in a loop as part of the knowledge base can mitigate this issue over the long term. A default set of few-shot learning examples can be utilized in instances where valid similarity results are not identified. Subsequently, humans in a loop can further refine the generated results, which can be incorporated into the knowledge base to optimize future outcomes related to similar queries.
Working on multiple tables presents a technical limitation because a question can be asked in various ways without providing any context to the LLM model regarding the correlation between the tables and schema information. Additionally, including all the table and schema information in the context window increases the overall input context length and, in some models, exceeds the maximum input token lengths. Several approaches can be explored in such cases, and the following section explores that topic in greater detail.
A few approaches to support NLP2SQL for a multiple table database or a complete data warehouse:
In this methodology, a consolidated database is generated by fetching all the relevant information pertaining to the intended user profile from various tables or data repositories, leveraging data processing and transformation methodologies.
This approach helps:
In this methodology, we utilize multiple tiers of RAG pipelines to accurately identify relevant few-shot examples aligned with the input query. Subsequently, we retrieve pertinent databases that exhibit compatibility with both the few-shot examples and the input query.
This approach helps in cases where further data processing and transformations are not possible to create a consolidated database based on user profile.
Within the context of certain applications, a subset of application data can contain Personally Identifiable Information (PII). Any PII might necessitate masking in specific applications and use cases, based on regional compliance guidelines and requirements related to model or data residency.
In such cases, “Sensitive Data Protection” or previously known as “Cloud Data Loss Prevention” (Cloud DLP) APIs can be employed to mask PII data effectively prior to transferring it to a model for further processing. We can employ de-masking of the processed data to re-identify the masked content through secure key post-processing based on the use case.
The above solution can be packaged and deployed based on the integration inference necessitated by the application under development. Several options have been proposed in the architecture outlined above:
Go beyond traditional approaches and explore the potential of Multi-agent architectures! Dive into latest blog post from one of our colleague, "Unlocking Insights — Natural Language QnA on Structured Data" through a Multi-agent Architecture" to discover a novel approach to tackling this NLP2SQL challenge.
Author:
Other contributors: