Similarity Evaluation #

Index

Similarity Evaluation

kreciprocal #

gptcache.similarity_evaluation.kreciprocal.euclidean_distance_calculate(vec_l: numpy.array, vec_r: numpy.array)[source]#

class gptcache.similarity_evaluation.kreciprocal.KReciprocalEvaluation(vectordb: gptcache.manager.vector_data.base.VectorBase, top_k: int = 3, max_distance: float = 4.0, positive: bool = False)[source]#

Bases: gptcache.similarity_evaluation.distance.SearchDistanceEvaluation

Using K Reciprocal to evaluate sentences pair similarity.

This evaluator borrows popular reranking method K-reprocical reranking for similarity evaluation. K-reciprocal relation refers to the mutual nearest neighbor relationship between two embeddings, where each embedding is the K nearest neighbor of the other based on a given distance metric. This evaluator checks whether the query embedding is in candidate cache embedding’s top_k nearest neighbors. If query embedding is not candidate’s top_k neighbors, the pair will be considered as dissimilar pair. Otherwise, their distance will be kept and continue for a SearchDistanceEvaluation check. max_distance is used to bound this distance to make it between [0-max_distance]. positive is used to indicate this distance is directly proportional to the similarity of two entites. If positive is set False, max_distance will be used to substract this distance to get the final score.

Parameters

vectordb (gptcache.manager.vector_data.base.VectorBase) – vector database to retrieval embeddings to test k-reciprocal relationship.
top_k (int) – for each retievaled candidates, this method need to test if the query is top-k of candidate.
max_distance (float) – the bound of maximum distance.
positive (bool) – if the larger distance indicates more similar of two entities, It is True. Otherwise it is False.

Example

from gptcache.similarity_evaluation import KReciprocalEvaluation
from gptcache.manager.vector_data.faiss import Faiss
from gptcache.manager.vector_data.base import VectorData
import numpy as np

faiss = Faiss('./none', 3, 10)
cached_data = np.array([0.57735027, 0.57735027, 0.57735027])
faiss.mul_add([VectorData(id=0, data=cached_data)])
evaluation = KReciprocalEvaluation(vectordb=faiss, top_k=2, max_distance = 4.0, positive=False)
query = np.array([0.61396013, 0.55814557, 0.55814557])
score = evaluation.evaluation(
    {
        'question': 'question1',
        'embedding': query
    },
    {
        'question': 'question2',
        'embedding': cached_data
    }
)

static normalize(vec: numpy.ndarray)[source]#

Normalize the input vector.

Parameters: vec (numpy.array) – numpy vector needs to normalize.
Returns: normalized vector.

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) → float[source]#

Evaluate the similarity score of pair.

Parameters

src_dict (Dict) – the query dictionary to evaluate with cache.
cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

sequence_match #

gptcache.similarity_evaluation.sequence_match.euclidean_distance_calculate(vec_l: numpy.array, vec_r: numpy.array)[source]#

gptcache.similarity_evaluation.sequence_match.reweight(weights, length)[source]#

class gptcache.similarity_evaluation.sequence_match.SequenceMatchEvaluation(weights: List[float], embedding_extractor: str, embedding_config=None)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Evaluate sentence pair similarity using SequenceMatchEvaluation.

Parameters

weights (List[float]) – List of weights corresponding to each sequence element for calculating the weighted distance.
embedding_extractor (gptcache.embedding.base.BaseEmbedding) – The embedding extractor used to obtain embeddings from the text content.

Example

from gptcache.similarity_evaluation import SequenceMatchEvaluation
from gptcache.embedding import Onnx

weights = [0.5, 0.3, 0.2]
evaluation = SequenceMatchEvaluation(weights, 'onnx')

query = {
    'question': 'USER: "foo2" USER: "foo4"',
}

cache = {
    'question': 'USER: "foo6" USER: "foo8"',
}

score = evaluation.evaluation(query, cache)

static normalize(vec: numpy.ndarray)[source]#

Normalize the input vector.

Parameters: vec (numpy.array) – numpy vector needs to normalize.
Returns: normalized vector.

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) → float[source]#

Evaluate the similarity score of pair.

Parameters

src_dict (Dict) – the query dictionary to evaluate with cache.
cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() → Tuple[float, float][source]#

Range of similarity score.

Returns: minimum and maximum of similarity score.

sbert_crossencoder #

class gptcache.similarity_evaluation.sbert_crossencoder.SbertCrossencoderEvaluation(model: str = 'cross-encoder/quora-distilroberta-base')[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using SBERT crossencoders to evaluate sentences pair similarity.

This evaluator use the crossencoder model to evaluate the similarity of two sentences.

Parameters: model – model name of SbertCrossencoderEvaluation. Default is ‘cross-encoder/quora-distilroberta-base’.

Check more please refer to https://www.sbert.net/docs/pretrained_cross-encoders.html#quora-duplicate-questions. :type model: str

Example

from gptcache.similarity_evaluation import SbertCrossencoderEvaluation

evaluation = SbertCrossencoderEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is the color of sky?'
    },
    {
        'question': 'hello'
    }
)

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) → float[source]#

Evaluate the similarity score of pair.

Parameters

src_dict (Dict) – the query dictionary to evaluate with cache.
cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() → Tuple[float, float][source]#

Range of similarity score.

Returns: minimum and maximum of similarity score.

distance #

class gptcache.similarity_evaluation.distance.SearchDistanceEvaluation(max_distance=4.0, positive=False)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using search distance to evaluate sentences pair similarity.

This is the evaluator to compare two embeddings according to their distance computed in embedding retrieval stage. In the retrieval stage, search_result is the distance used for approximate nearest neighbor search and have been put into cache_dict. max_distance is used to bound this distance to make it between [0-max_distance]. positive is used to indicate this distance is directly proportional to the similarity of two entites. If positive is set False, max_distance will be used to substract this distance to get the final score.

Parameters

max_distance (float) – the bound of maximum distance.
positive (bool) – if the larger distance indicates more similar of two entities, It is True. Otherwise it is False.

Example

from gptcache.similarity_evaluation import SearchDistanceEvaluation

evaluation = SearchDistanceEvaluation()
score = evaluation.evaluation(
    {},
    {
        "search_result": (1, None)
    }
)

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) → float[source]#

Evaluate the similarity score of pair. :param src_dict: the query dictionary to evaluate with cache. :type src_dict: Dict :param cache_dict: the cache dictionary. :type cache_dict: Dict

Returns: evaluation score.

range() → Tuple[float, float][source]#

Range of similarity score.

Returns: minimum and maximum of similarity score.

exact_match #

class gptcache.similarity_evaluation.exact_match.ExactMatchEvaluation[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using exact metric to evaluate sentences pair similarity.

This evaluator is used to directly compare two question from text. If every single character in two questions can match, then this evaluator will return 1 else 0.

Example

from gptcache.similarity_evaluation import ExactMatchEvaluation

evaluation = ExactMatchEvaluation()
score = evaluation.evaluation(
    {
        "question": "What is the color of sky?"
    },
    {
        "question": "What is the color of sky?"
    }
)

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) → float[source]#

Evaluate the similarity score of pair.

Parameters

src_dict (Dict) – the query dictionary to evaluate with cache_dict.
cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() → Tuple[float, float][source]#

Range of similarity score.

Returns: minimum and maximum of similarity score.

cohere_rerank #

class gptcache.similarity_evaluation.cohere_rerank.CohereRerank(model: str = 'rerank-english-v2.0', api_key: Optional[str] = None)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Use the Cohere Rerank API to evaluate relevance of question and answer.

Reference: https://docs.cohere.com/reference/rerank-1

Parameters

model (str) – model name, defaults to ‘rerank-english-v2.0’, and multilingual option: rerank-multilingual-v2.0.
api_key (str) – cohere api key, defaults to None.

Example

from gptcache.similarity_evaluation import CohereRerankEvaluation

evaluation = CohereRerankEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is the color of sky?'
    },
    {
        'answer': 'the color of sky is blue'
    }
)

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **kwargs) → float[source]#

Evaluate the similarity score of the user and cache requests pair.

Parameters

src_dict (Dict) – the user request params.
cache_dict (Dict) – the cache request params.

range() → Tuple[float, float][source]#

Range of similarity score.

Returns: the range of similarity score, which is the min and max values
Return type: Tuple[float, float]

np #

class gptcache.similarity_evaluation.np.NumpyNormEvaluation(enable_normal: bool = True, question_embedding_function=None)[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using Numpy norm to evaluate sentences pair similarity.

This evaluator calculate the L2 distance of two embeddings for similarity check. if enable_normal is True, both query embedding and cache embedding will be normalized. Note normalized distance will substracted by maximum distance so it will be positively correlated with the similarity.

Parameters

enable_normal (bool) – whether to normalize the embedding, defaults to False.
question_embedding_function (function) – optional, a function to generate question embedding

Example

from gptcache.similarity_evaluation import NumpyNormEvaluation
import numpy as np

evaluation = NumpyNormEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is color of sky?'
        'embedding': np.array([-0.5, -0.5])
    },
    {
        'question': 'What is the color of sky?'
        'embedding': np.array([-0.49, -0.51])
    }
)

static normalize(vec: numpy.ndarray)[source]#

Normalize the input vector.

Parameters: vec (numpy.array) – numpy vector needs to normalize.
Returns: normalized vector.

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) → float[source]#

Evaluate the similarity score of pair.

Parameters

src_dict (Dict) – the query dictionary to evaluate with cache.
cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() → Tuple[float, float][source]#

Range of similarity score.

Returns: minimum and maximum of similarity score.

onnx #

gptcache.similarity_evaluation.onnx.pad_sequence(input_ids_list: List[numpy.ndarray], padding_value: int = 0)[source]#

class gptcache.similarity_evaluation.onnx.OnnxModelEvaluation(model: str = 'GPTCache/albert-duplicate-onnx')[source]#

Bases: gptcache.similarity_evaluation.similarity_evaluation.SimilarityEvaluation

Using ONNX model to evaluate sentences pair similarity.

This evaluator use the ONNX model to evaluate the similarity of two sentences.

Parameters: model (str) – model name of OnnxModelEvaluation. Default is ‘GPTCache/albert-duplicate-onnx’.

Example

from gptcache.similarity_evaluation import OnnxModelEvaluation

evaluation = OnnxModelEvaluation()
score = evaluation.evaluation(
    {
        'question': 'What is the color of sky?'
    },
    {
        'question': 'hello'
    }
)

evaluation(src_dict: Dict[str, Any], cache_dict: Dict[str, Any], **_) → float[source]#

Evaluate the similarity score of pair.

Parameters

src_dict (Dict) – the query dictionary to evaluate with cache.
cache_dict (Dict) – the cache dictionary.

Returns

evaluation score.

range() → Tuple[float, float][source]#

Range of similarity score.

Returns: minimum and maximum of similarity score.

inference(reference: str, candidates: List[str]) → numpy.ndarray[source]#

Inference the ONNX model.

Parameters

reference (str) – reference sentence.
candidates (List[str]) – candidate sentences.

Returns

probability score indcates how much is reference similar to candidates.