SQL Translate#
This example will show you how to translate natural language to SQL queries, the original example is on OpenAI Example, the difference is that we will teach you how to cache the response for exact and similar matches with gptcache, it will be very simple, you just need to add an extra step to initialize the cache.
Before running the example, make sure the OPENAI_API_KEY
environment variable is set by executing echo $OPENAI_API_KEY
. If it is not already set, it can be set by using export OPENAI_API_KEY=YOUR_API_KEY
on Unix/Linux/MacOS systems or set OPENAI_API_KEY=YOUR_API_KEY
on Windows systems.
Then we can learn the usage and acceleration effect of gptcache by the following code, which consists of three parts, the original openai way, the exact search and the similar search.
OpenAI API original usage#
import time
import openai
question = "A query to list the names of the departments which employed more than 10 employees in the last 3 months\nSELECT"
def response_text(openai_resp):
return openai_resp["choices"][0]["text"]
start_time = time.time()
response = openai.Completion.create(
model="text-davinci-003",
prompt="### Postgres SQL tables, with their properties:\n#\n# Employee(id, name, department_id)\n# Department(id, name, address)\n# Salary_Payments(id, employee_id, amount, date)\n#\n### " + question,
temperature=0,
max_tokens=150,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
stop=["#", ";"]
)
print("\nSELECT", response_text(response))
print("Time consuming: {:.2f}s".format(time.time() - start_time))
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 2.78s
OpenAI API + GPTCache, exact match cache#
Initalize the cache to run GPTCache and import openai
form gptcache.adapter
, which will automatically set the map data manager to match the exact cahe, more details refer to build your cache.
import time
def response_text(openai_resp):
return openai_resp["choices"][0]["text"]
print("Cache loading.....")
# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.processor.pre import get_prompt
cache.init(pre_embedding_func=get_prompt)
cache.set_openai_key()
# -------------------------------------------------
questions = [
"A query to list the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
"Query the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
"List the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
]
for question in questions:
start_time = time.time()
response = openai.Completion.create(
model="text-davinci-003",
prompt="### Postgres SQL tables, with their properties:\n#\n# Employee(id, name, department_id)\n# Department(id, name, address)\n# Salary_Payments(id, employee_id, amount, date)\n#\n### " + question,
temperature=0,
max_tokens=150,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
stop=["#", ";"]
)
print("\nSELECT", response_text(response))
print("Time consuming: {:.2f}s".format(time.time() - start_time))
Cache loading.....
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 1.88s
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 1.46s
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 2.08s
OpenAI API + GPTCache, similar search cache#
Set the cache with pre_embedding_func
to preprocess the input data, embedding_func
to generate embedding for the text, and data_manager
to manager the cache data, similarity_evaluation
to evaluate the similarities, more details refer to build your cache.
import time
def response_text(openai_resp):
return openai_resp["choices"][0]["text"]
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.processor.pre import get_prompt
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
print("Cache loading.....")
onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(pre_embedding_func=get_prompt,
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()
questions = [
"A query to list the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
"Query the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
"List the names of the departments which employed more than 10 employees in the last 3 months\nSELECT",
]
for question in questions:
start_time = time.time()
response = openai.Completion.create(
model="text-davinci-003",
prompt="### Postgres SQL tables, with their properties:\n#\n# Employee(id, name, department_id)\n# Department(id, name, address)\n# Salary_Payments(id, employee_id, amount, date)\n#\n### " + question,
temperature=0,
max_tokens=150,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
stop=["#", ";"]
)
print(question, response_text(response))
print("Time consuming: {:.2f}s".format(time.time() - start_time))
Cache loading.....
A query to list the names of the departments which employed more than 10 employees in the last 3 months
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 1.89s
Query the names of the departments which employed more than 10 employees in the last 3 months
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 0.25s
List the names of the departments which employed more than 10 employees in the last 3 months
SELECT d.name
FROM Department d
INNER JOIN Employee e ON d.id = e.department_id
INNER JOIN Salary_Payments sp ON e.id = sp.employee_id
WHERE sp.date > NOW() - INTERVAL '3 months'
GROUP BY d.name
HAVING COUNT(*) > 10
Time consuming: 0.22s
We find that the performance improvement when searching the similar because the three statements of the query are similar, and hitting cache in gptcache, so it will return the cached results directly instead of requesting. And you can then also try running the query again for exact search, which will also speed up.