Similarity Evaluation#

kreciprocal#

gptcache.similarity_evaluation.kreciprocal.euclidean_distance_calculate(vec_l: numpy.array, vec_r: numpy.array)[source]#
class gptcache.similarity_evaluation.kreciprocal.KReciprocalEvaluation(vectordb: gptcache.manager.vector_data.base.VectorBase, top_k: int = 3, max_distance: float = 4.0, positive: bool = False)[source]#

Bases: gptcache.similarity_evaluation.distance.SearchDistanceEvaluation

Using K Reciprocal to evaluate sentences pair similarity.

This evaluator borrows popular reranking method K-reprocical reranking for similarity evaluation. K-reciprocal relation refers to the mutual nearest neighbor relationship between two embeddings, where each embedding is the K nearest neighbor of the other based on a given distance metric. This evaluator checks whether the query embedding is in candidate cache embedding’s top_k nearest neighbors. If query embedding is not candidate’s top_k neighbors, the pair will be considered as dissimilar pair. Otherwise, their distance will be kept and continue for a SearchDistanceEvaluation check. max_distance is used to bound this distance to make it between [0-max_distance]. positive is used to indicate this distance is directly proportional to the similarity of two entites. If positive is set False, max_distance will be used to substract this distance to get the final score.

Parameters
  • vectordb (gptcache.manager.vector_data.base.VectorBase) – vector database to retrieval embeddings to test k-reciprocal relationship.

  • top_k (int) – for each retievaled candidates, this method need to test if the query is top-k of candidate.

  • max_distance (float) – the bound of maximum distance.

  • positive (bool) – if the larger distance indicates more similar of two entities, It is True. Otherwise it is False.

Example

from gptcache.similarity_evaluation import KReciprocalEvaluation
from gptcache.manager.vector_data.faiss import Faiss
from gptcache.manager.vector_data.base import VectorData
import numpy as np

faiss = Faiss('./none', 3, 10)
cached_data = np.array([0.57735027, 0.57735027, 0.57735027])
faiss.mul_add([VectorData(id=0, data=cached_data)])
evaluation = KReciprocalEvaluation(vectordb=faiss, top_k=2, max_distance = 4.0, positive=False)
query = np.array([0.61396013, 0.55814557, 0.55814557])
score = evaluation.evaluation(
    {
        'question': 'question1',
        'embedding': query
    },
    {
        'question': 'question2',
        'embedding': cached_data
    }
)
static normalize(vec: numpy.ndarray)[source]#

Normalize the input vector.

Parameters

vec (numpy.array) – numpy vector needs to normalize.

Returns

normalized vector.

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) float[source]#

Evaluate the similarity score of pair.

Parameters
  • src_dict (Dict) – the query dictionary to evaluate with cache.

  • cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

sequence_match#

gptcache.similarity_evaluation.sequence_match.euclidean_distance_calculate(vec_l: numpy.array, vec_r: numpy.array)[source]#
gptcache.similarity_evaluation.sequence_match.reweight(weights, length)[source]#
class gptcache.similarity_evaluation.sequence_match.SequenceMatchEvaluation(weights: List[float], embedding_extractor: str, embedding_config=None)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Evaluate sentence pair similarity using SequenceMatchEvaluation.

Parameters
  • weights (List[float]) – List of weights corresponding to each sequence element for calculating the weighted distance.

  • embedding_extractor (gptcache.embedding.base.BaseEmbedding) – The embedding extractor used to obtain embeddings from the text content.

Example

from gptcache.similarity_evaluation import SequenceMatchEvaluation
from gptcache.embedding import Onnx

weights = [0.5, 0.3, 0.2]
evaluation = SequenceMatchEvaluation(weights, 'onnx')

query = {
    'question': 'USER: "foo2" USER: "foo4"',
}

cache = {
    'question': 'USER: "foo6" USER: "foo8"',
}

score = evaluation.evaluation(query, cache)
static normalize(vec: numpy.ndarray)[source]#

Normalize the input vector.

Parameters

vec (numpy.array) – numpy vector needs to normalize.

Returns

normalized vector.

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) float[source]#

Evaluate the similarity score of pair.

Parameters
  • src_dict (Dict) – the query dictionary to evaluate with cache.

  • cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

minimum and maximum of similarity score.

sbert_crossencoder#

class gptcache.similarity_evaluation.sbert_crossencoder.SbertCrossencoderEvaluation(model: str = 'cross-encoder/quora-distilroberta-base')[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using SBERT crossencoders to evaluate sentences pair similarity.

This evaluator use the crossencoder model to evaluate the similarity of two sentences.

Parameters

model – model name of SbertCrossencoderEvaluation. Default is β€˜cross-encoder/quora-distilroberta-base’.

Check more please refer to https://www.sbert.net/docs/pretrained_cross-encoders.html#quora-duplicate-questions. :type model: str

Example

from gptcache.similarity_evaluation import SbertCrossencoderEvaluation

evaluation = SbertCrossencoderEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is the color of sky?'
    },
    {
        'question': 'hello'
    }
)
evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) float[source]#

Evaluate the similarity score of pair.

Parameters
  • src_dict (Dict) – the query dictionary to evaluate with cache.

  • cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

minimum and maximum of similarity score.

distance#

class gptcache.similarity_evaluation.distance.SearchDistanceEvaluation(max_distance=4.0, positive=False)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using search distance to evaluate sentences pair similarity.

This is the evaluator to compare two embeddings according to their distance computed in embedding retrieval stage. In the retrieval stage, search_result is the distance used for approximate nearest neighbor search and have been put into cache_dict. max_distance is used to bound this distance to make it between [0-max_distance]. positive is used to indicate this distance is directly proportional to the similarity of two entites. If positive is set False, max_distance will be used to substract this distance to get the final score.

Parameters
  • max_distance (float) – the bound of maximum distance.

  • positive (bool) – if the larger distance indicates more similar of two entities, It is True. Otherwise it is False.

Example

from gptcache.similarity_evaluation import SearchDistanceEvaluation

evaluation = SearchDistanceEvaluation()
score = evaluation.evaluation(
    {},
    {
        "search_result": (1, None)
    }
)
evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) float[source]#

Evaluate the similarity score of pair. :param src_dict: the query dictionary to evaluate with cache. :type src_dict: Dict :param cache_dict: the cache dictionary. :type cache_dict: Dict

Returns

evaluation score.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

minimum and maximum of similarity score.

exact_match#

class gptcache.similarity_evaluation.exact_match.ExactMatchEvaluation[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using exact metric to evaluate sentences pair similarity.

This evaluator is used to directly compare two question from text. If every single character in two questions can match, then this evaluator will return 1 else 0.

Example

from gptcache.similarity_evaluation import ExactMatchEvaluation

evaluation = ExactMatchEvaluation()
score = evaluation.evaluation(
    {
        "question": "What is the color of sky?"
    },
    {
        "question": "What is the color of sky?"
    }
)
evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) float[source]#

Evaluate the similarity score of pair.

Parameters
  • src_dict (Dict) – the query dictionary to evaluate with cache_dict.

  • cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

minimum and maximum of similarity score.

cohere_rerank#

class gptcache.similarity_evaluation.cohere_rerank.CohereRerank(model: str = 'rerank-english-v2.0', api_key: Optional[str] = None)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Use the Cohere Rerank API to evaluate relevance of question and answer.

Reference: https://docs.cohere.com/reference/rerank-1

Parameters
  • model (str) – model name, defaults to β€˜rerank-english-v2.0’, and multilingual option: rerank-multilingual-v2.0.

  • api_key (str) – cohere api key, defaults to None.

Example

from gptcache.similarity_evaluation import CohereRerankEvaluation

evaluation = CohereRerankEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is the color of sky?'
    },
    {
        'answer': 'the color of sky is blue'
    }
)
evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **kwargs) float[source]#

Evaluate the similarity score of the user and cache requests pair.

Parameters
  • src_dict (Dict) – the user request params.

  • cache_dict (Dict) – the cache request params.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

the range of similarity score, which is the min and max values

Return type

Tuple[float, float]

np#

class gptcache.similarity_evaluation.np.NumpyNormEvaluation(enable_normal: bool = True, question_embedding_function=None)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using Numpy norm to evaluate sentences pair similarity.

This evaluator calculate the L2 distance of two embeddings for similarity check. if enable_normal is True, both query embedding and cache embedding will be normalized. Note normalized distance will substracted by maximum distance so it will be positively correlated with the similarity.

Parameters
  • enable_normal (bool) – whether to normalize the embedding, defaults to False.

  • question_embedding_function (function) – optional, a function to generate question embedding

Example

from gptcache.similarity_evaluation import NumpyNormEvaluation
import numpy as np

evaluation = NumpyNormEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is color of sky?'
        'embedding': np.array([-0.5, -0.5])
    },
    {
        'question': 'What is the color of sky?'
        'embedding': np.array([-0.49, -0.51])
    }
)
static normalize(vec: numpy.ndarray)[source]#

Normalize the input vector.

Parameters

vec (numpy.array) – numpy vector needs to normalize.

Returns

normalized vector.

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) float[source]#

Evaluate the similarity score of pair.

Parameters
  • src_dict (Dict) – the query dictionary to evaluate with cache.

  • cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

minimum and maximum of similarity score.

onnx#

gptcache.similarity_evaluation.onnx.pad_sequence(input_ids_list: List[numpy.ndarray], padding_value: int = 0)[source]#
class gptcache.similarity_evaluation.onnx.OnnxModelEvaluation(model: str = 'GPTCache/albert-duplicate-onnx')[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using ONNX model to evaluate sentences pair similarity.

This evaluator use the ONNX model to evaluate the similarity of two sentences.

Parameters

model (str) – model name of OnnxModelEvaluation. Default is β€˜GPTCache/albert-duplicate-onnx’.

Example

from gptcache.similarity_evaluation import OnnxModelEvaluation

evaluation = OnnxModelEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is the color of sky?'
    },
    {
        'question': 'hello'
    }
)
evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) float[source]#

Evaluate the similarity score of pair.

Parameters
  • src_dict (Dict) – the query dictionary to evaluate with cache.

  • cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

minimum and maximum of similarity score.

inference(reference: str, candidates: List[str]) numpy.ndarray[source]#

Inference the ONNX model.

Parameters
  • reference (str) – reference sentence.

  • candidates (List[str]) – candidate sentences.

Returns

probability score indcates how much is reference similar to candidates.

similarity_evaluation#

class gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation[source]#

Bases: object

Similarity Evaluation interface, determine the similarity between the input request and the requests from the Vector Store. Based on this similarity, it determines whether a request matches the cache.

Example

from gptcache import cache
from gptcache.similarity_evaluation import SearchDistanceEvaluation

cache.init(
    similarity_evaluation=SearchDistanceEvaluation()
)
abstract evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **kwargs) float[source]#

Evaluate the similarity score of the user and cache requests pair.

Parameters
  • src_dict (Dict) – the user request params.

  • cache_dict (Dict) – the cache request params.

abstract range() Tuple[float, float][source]#

Range of similarity score.

Returns

the range of similarity score, which is the min and max values

Return type

Tuple[float, float]

time#

class gptcache.similarity_evaluation.time.TimeEvaluation(evaluation: str, evaluation_config=None, time_range: float = 86400.0)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Add time dimension restrictions on the basis of other Evaluation, for example, only use the cache within 1 day from the current time, and filter out the previous cache.

Parameters
  • evaluation – Similarity evaluation, like distance/onnx.

  • evaluation_config – Similarity evaluation config.

  • time_range – Time range, time unit: s

Example

import datetime

from gptcache.manager.scalar_data.base import CacheData
from gptcache.similarity_evaluation import TimeEvaluation

evaluation = TimeEvaluation(evaluation="distance", time_range=86400)

similarity = eval.evaluation(
    {},
    {
        "search_result": (3.5, None),
        "cache_data": CacheData("a", "b", create_on=datetime.datetime.now()),
    },
)
# 0.5
evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **kwargs) float[source]#

Evaluate the similarity score of the user and cache requests pair.

Parameters
  • src_dict (Dict) – the user request params.

  • cache_dict (Dict) – the cache request params.

range() Tuple[float, float][source]#

Range of similarity score.

Returns

the range of similarity score, which is the min and max values

Return type

Tuple[float, float]