SQL Translate#

This example will show you how to translate natural language to SQL queries, the original example is on OpenAI Example, the difference is that we will teach you how to cache the response for exact and similar matches with gptcache, it will be very simple, you just need to add an extra step to initialize the cache.

Before running the example, make sure the OPENAI_API_KEY environment variable is set by executing echo $OPENAI_API_KEY. If it is not already set, it can be set by using export OPENAI_API_KEY=YOUR_API_KEY on Unix/Linux/MacOS systems or set OPENAI_API_KEY=YOUR_API_KEY on Windows systems.

Then we can learn the usage and acceleration effect of gptcache by the following code, which consists of three parts, the original openai way, the exact search and the similar search.

OpenAI API original usage#

import time
import openai

question = "A query to list the names of the departments which employed more than 10 employees in the last 3 months\nSELECT"
def response_text(openai_resp):
    return openai_resp["choices"][0]["text"]

start_time = time.time()
response = openai.Completion.create(
  model="text-davinci-003",
  prompt="### Postgres SQL tables, with their properties:\n#\n# Employee(id, name, department_id)\n# Department(id, name, address)\n# Salary_Payments(id, employee_id, amount, date)\n#\n### " + question,
  temperature=0,
  max_tokens=150,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0,
  stop=["#", ";"]
)
print("\nSELECT", response_text(response))
print("Time consuming: {:.2f}s".format(time.time() - start_time))
SELECT  d.name 
FROM Department d 
INNER JOIN Employee e ON d.id = e.department_id 
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id 
WHERE sp.date > NOW() - INTERVAL '3 months' 
GROUP BY d.name 
HAVING COUNT(*) > 10
Time consuming: 2.78s

OpenAI API + GPTCache, exact match cache#

Initalize the cache to run GPTCache and import openai form gptcache.adapter, which will automatically set the map data manager to match the exact cahe, more details refer to build your cache.

import time

def response_text(openai_resp):
    return openai_resp["choices"][0]["text"]

print("Cache loading.....")

# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.processor.pre import get_prompt

cache.init(pre_embedding_func=get_prompt)
cache.set_openai_key()
# -------------------------------------------------

questions = [
    "A query to list the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
    "Query the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
    "List the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
]


for question in questions:
    start_time = time.time()
    response = openai.Completion.create(
      model="text-davinci-003",
      prompt="### Postgres SQL tables, with their properties:\n#\n# Employee(id, name, department_id)\n# Department(id, name, address)\n# Salary_Payments(id, employee_id, amount, date)\n#\n### " + question,
      temperature=0,
      max_tokens=150,
      top_p=1.0,
      frequency_penalty=0.0,
      presence_penalty=0.0,
      stop=["#", ";"]
    )
    print("\nSELECT", response_text(response))
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
Cache loading.....

SELECT  d.name 
FROM Department d 
INNER JOIN Employee e ON d.id = e.department_id 
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id 
WHERE sp.date > NOW() - INTERVAL '3 months' 
GROUP BY d.name 
HAVING COUNT(*) > 10
Time consuming: 1.88s

SELECT  d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 1.46s

SELECT  d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 2.08s

OpenAI API + GPTCache, similar search cache#

Set the cache with pre_embedding_func to preprocess the input data, embedding_func to generate embedding for the text, and data_manager to manager the cache data, similarity_evaluation to evaluate the similarities, more details refer to build your cache.

import time


def response_text(openai_resp):
    return openai_resp["choices"][0]["text"]

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

print("Cache loading.....")

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(pre_embedding_func=get_prompt,
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )
cache.set_openai_key()

questions = [
    "A query to list the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
    "Query the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
    "List the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
]


for question in questions:
    start_time = time.time()
    response = openai.Completion.create(
      model="text-davinci-003",
      prompt="### Postgres SQL tables, with their properties:\n#\n# Employee(id, name, department_id)\n# Department(id, name, address)\n# Salary_Payments(id, employee_id, amount, date)\n#\n### " + question,
      temperature=0,
      max_tokens=150,
      top_p=1.0,
      frequency_penalty=0.0,
      presence_penalty=0.0,
      stop=["#", ";"]
    )
    print(question, response_text(response))
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
Cache loading.....
A query to list the names of the departments which employed more than 10 employees in the last 3 months
SELECT  d.name 
FROM Department d 
INNER JOIN Employee e ON d.id = e.department_id 
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id 
WHERE sp.date > NOW() - INTERVAL '3 months' 
GROUP BY d.name 
HAVING COUNT(*) > 10
Time consuming: 1.89s
Query the names of the departments which employed more than 10 employees in the last 3 months
SELECT  d.name 
FROM Department d 
INNER JOIN Employee e ON d.id = e.department_id 
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id 
WHERE sp.date > NOW() - INTERVAL '3 months' 
GROUP BY d.name 
HAVING COUNT(*) > 10
Time consuming: 0.25s
List the names of the departments which employed more than 10 employees in the last 3 months
SELECT  d.name 
FROM Department d 
INNER JOIN Employee e ON d.id = e.department_id 
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id 
WHERE sp.date > NOW() - INTERVAL '3 months' 
GROUP BY d.name 
HAVING COUNT(*) > 10
Time consuming: 0.22s

We find that the performance improvement when searching the similar because the three statements of the query are similar, and hitting cache in gptcache, so it will return the cached results directly instead of requesting. And you can then also try running the query again for exact search, which will also speed up.