How to better configure your cache#

Last update time: 2023.6.26

Latest version: v0.1.32

Before reading the following content, you need to understand the basic composition of GPTCache, you need to finish reading:

Introduction to GPTCache initialization#

GPTCache core components include:

  • pre-process func

  • embedding

  • data manager

    • cache store

    • vector store

    • object store (optional, multi-model)

  • similarity evaluation

  • post-process func

The above core components need to be set when similar caches are initialized, and of course most of them have default values. In addition to these, there are additional parameters, including:

  • config, some configurations of the cache, such as similarity thresholds, parameter values of some specific preprocessing functions, etc.;

  • next_cache, can be used to set up a multi-level cache.

    For example, there are two GPTCaches, L1 and L2, where L1 sets L2 as the next cache during initialization.

    When accepting a user request, if the L1 cache misses, it will go to the L2 cache to find it.

    If the L2 also misses, it will call the LLM, and then store the results in the L1 and L2 caches.

    If the L2 hits, the cached result will be stored in the L1 cache

The above is the basic description of all initialization parameters.

In GPTCache lib, there is a global cache object. If the llm request does not set the cache object, this global object is used.

There are currently three methods of initializing the cache, namely:

  1. The init method of the Cache class defaults to exact key matching, which is a simple map cache, that is:

def init(
    self,
    cache_enable_func=cache_all,
    pre_func=last_content,
    embedding_func=string_embedding,
    data_manager: DataManager = get_data_manager(),
    similarity_evaluation=ExactMatchEvaluation(),
    post_func=temperature_softmax,
    config=Config(),
    next_cache=None,
  ):
  pass
  1. The init_similar_cache method in the api package defaults to similar matching of onnx+sqlite+faiss

def init_similar_cache(
    data_dir: str = "api_cache",
    cache_obj: Optional[Cache] = None,
    pre_func: Callable = get_prompt,
    embedding: Optional[BaseEmbedding] = None,
    data_manager: Optional[DataManager] = None,
    evaluation: Optional[SimilarityEvaluation] = None,
    post_func: Callable = temperature_softmax,
    config: Config = Config(),
  ):
  pass
  1. The init_similar_cache_from_config in the api package initializes the cache through the yaml file, and the default is fuzzy matching of onnx+sqlite+faiss, more details: GPTCache server configuration

def init_similar_cache_from_config(config_dir: str, cache_obj: Optional[Cache] = None):
  pass

Pre-Process function#

The preprocessing function is mainly used to obtain user question information from the user llm request parameter list, assemble this part of information into a string and return it. The return value is the input of the embedding model.

It is worth noting that different llms need to use different preprocessing functions, because the request parameter list of each llm is inconsistent. And the parameter names containing user problem information are also different.

Of course, if you want to use different pre-processing processes according to other llm parameters of the user, this is also possible.

The definition of the preprocessing function receives two parameters, and the return value can be one or two.

def foo_pre_process_func(data: Dict[str, Any], **params: Dict[str, Any]) -> Any:
    pass

Among them, data is the list of user parameters, and params is some additional parameters, such as cache config, which can be obtained through params.get("cache_config", None).

If there is no special requirement, the function can return a value, which is used for the input of embedding and the key of the current request cache.

Of course, two values can also be returned, the first one is used as the key of the current request cache, and the second one is used as the input of embdding, which is currently mainly used to handle long openai chat conversations. In the case of a long dialogue, the first return value is the user’s original long dialogue, and only simple dialogue string splicing is performed, and the second return value is to extract the key information of the long dialogue through some models, shortening the embedding input.

Currently available preprocessing functions:

all source code reference: processor/pre

all preprocessing api reference: gptcache.processor.pre

If you are confused about the role of the following preprocessing functions, you can check the api reference, which contains simple function examples.

openai chat complete#

  • last_content: get the last content of the message list.

  • last_content_without_prompt: get the last content of the message list without prompts content. It needs to be used with the prompts parameter in Config. If it is not set, it will have the same effect as last_content.

  • last_content_without_template: get the last content’s template values of the message list without template content. The functionality is similar to the previous one, but it can handle more complex templates. The above is only a simple judgment through the string, that is, the user’s prompt must be continuous. But last_content_without_template can support string template type, please refer to api reference for specific usage.

  • all_content: simply concat the contents of the messages list in the user request.

  • concat_all_queries: concat the content and role info of the message list.

  • context_process: to deal with long dialogues in openai, the core is to compress the dialogue through some methods, and extract the core content of the dialogue as the key of the cache.

langchain llm#

  • get_prompt: get the prompt of the llm request params.

langchain chat llm#

  • get_messages_last_content: get the last content of the llm request message object array.

openai image#

  • get_prompt: get the prompt of the llm request params.

openai audio#

  • get_file_name: get the file name of the llm request params

  • get_file_bytes: get the file bytes of the llm request params

openai moderation#

  • get_openai_moderation_input: get the input param of the openai moderation request params

llama#

  • get_prompt: get the prompt of the llm request params.

replicate (image -> text, image and text -> text)#

  • get_input_str: get the image and question str of the llm request params

  • get_input_image_file_name: get the image file name of the llm request params

stable diffusion#

  • get_prompt: get the prompt of the llm request params.

minigpt4#

  • get_image_question: get the image and question str of the llm request params

  • get_image: get the image of the llm request params

dolly#

  • get_inputs: get the inputs of the llm request params

NOTE: For different llm, different preprocessing functions should be selected when the cache is initialized. If not, you can choose to customize.

Embedding#

Convert the input into a multidimensional array of numbers, which are classified according to the input type.

Whether the cache is accurate or not, the choice of embedding model is more important. A few points worth noting: the language supported by the model, and the number of tokens supported by the model. In addition, generally speaking, under certain computer resources, large models are more accurate, but time-consuming; small models run faster, but are less accurate.

all embedding api reference: embedding api

text#

audio#

image#

NOTE: you need to select the appropriate embedding model according to the data type, and you also need to look at the language supported by embedding.

Data Manager#

For the similar cache of text, only cache store and vector store are needed. If it is a multi-modal cache, object store is additionally required. The choice of storage is not related to the llm type, but it should be noted that the vector dimension needs to be set when using the vector store.

cache store#

  • sqlite

  • duckdb

  • mysql

  • mariadb

  • sqlserver

  • oracle

  • postgresql

vector store#

  • milvus

  • faiss

  • chromadb

  • hnswlib

  • pgvector

  • docarray

  • usearch

  • redis

object store#

  • local

  • s3

how to get a data manager#

  • Use factory to get it by the store name

vector_params is the parameter required to build the vector store;

scalar_params is the parameter required to build the cache store;

from gptcache.manager import manager_factory

data_manager = manager_factory("sqlite,faiss", data_dir="./workspace", scalar_params={}, vector_params={"dimension": 128})
  • Combining each store object through get_data_manager method

from gptcache.manager import get_data_manager, CacheBase, VectorBase

data_manager = get_data_manager(CacheBase('sqlite'), VectorBase('faiss', dimension=128))

Note that each store has more initialization parameters, you can reference the store’s constructor method by the store api reference.

Similarity Evaluation#

If you want the cache to play a better role, in addition to embedding and vector engines, appropriate similarity evaluation is also very critical.

The similarity evaluation is mainly: evaluate the recalled cache data according to the current user’s llm request, and obtain a float value. The easiest way is to use the embedding distance. Of course, there are other methods, such as using a model to judge the similarity of two problems.

The following are similar evaluation components that already exist.

  1. SearchDistanceEvaluation, vector search distance, simple, fast, but not very accurate

  2. OnnxModelEvaluation, use the model to compare the degree of correlation between the two questions. The small model only supports 512token, which is more accurate than the distance

  3. NumpyNormEvaluation, calculate the distance between the two embedding vectors of the llm request and the cache data, which is fast and simple, and the accuracy is almost the same as the distance

  4. KReciprocalEvaluation, use the K-reprocical algorithm to calculate the similarity for reranking, and recall multiple cache data for comparison. It needs to be recalled many times, which is more time-consuming and relatively more accurate. For more information, refer to the api reference

  5. CohereRerankEvaluation, use the cohere rerank api server, more accurate, at a cost, more details: cohere rerank

  6. SequenceMatchEvaluation, sequence matching, suitable for multiple rounds of dialogue, separates each round of dialogue for similar evaluation, and then obtains the final score through the proportion

  7. TimeEvaluation, evaluate by cache creation time, avoid using stale cache

  8. SbertCrossencoderEvaluation, use the sbert model for rerank evaluation, which is currently the best similarity evaluation found

More detailed usage reference api doc

Of course, if you want to get a better Similarity Evaluation, you need to customize it according to the scene, such as assembling the existing Similarity Evaluation. If you want to get better caching effect in long conversations, you may need to assemble SequenceMatchEvaluation, TimeEvaluation, TimeEvaluation, of course there may be a better way.

Post-Process function#

Post-processing is mainly to obtain the final answer to user questions based on all cached data that meet the similarity threshold. One of them can be selected according to a certain strategy in the cached data list, or the model can be used to fine-tune these answers, so that similar questions can have different answers.

Currently Existing Postprocessing Functions:

  1. temperature_softmax, select according to the softmax strategy, which can ensure that the obtained cached answer has a certain randomness

  2. first, get the most similar cached answer

  3. random, randomly fetch a similar cached answer