API

sigalike.similarity.best_match(collection1: Collection | Mapping | str, collection2: Collection | Mapping, shift: int = 4, preprocess: bool = True) BestMatch | Dict[str, BestMatch][source]

Returns the best match(es) between two collections or a string and a collection.

Parameters:
  • collection1 (Collection, Mapping, or str) -- The first collection to compare or a single string.

  • collection2 (Collection or Mapping) -- The second collection to compare.

  • shift (int, optional) -- The shift parameter for the shifted sigmoid similarity metric.

  • preprocess (bool, optional) -- Whether to preprocess the input strings.

Returns:

If both inputs are collections, returns a dictionary where the keys are the strings from the first collection and the values are the named tuples with the best string match and its associated score from the second collection. If collection1 is a string and collection2 is a collection, returns a named tuple with the best matching string and its associated score.

Return type:

BestMatch or Dict[str, BestMatch]

Raises:
  • ValueError -- If either of the input collections are empty.

  • TypeError -- If one or more inputs are not of the correct type (str, Collection, or Mapping).

Examples

>>> best_match("hello", ["hello", "world", "foo"])
BestMatch(match='hello', score=1.0)
>>> best_match(["hello", "world"], ["hello", "world", "foo"])
{'hello': BestMatch(match='hello', score=1.0), 'world': BestMatch(match='world', score=1.0)}
>>> best_match(["hello", "world"], ["foo", "bar"])
{'hello': BestMatch(match='', score=0.0), 'world': BestMatch(match='', score=0.0)}
sigalike.similarity.shifted_sigmoid_similarity(str1: str, str2: str, shift: int = 4, preprocess: bool = True) float[source]

Calculates the shifted sigmoid similarity score between two strings.

The shifted sigmoid similarity score acts as a fuzzy string matching metric. It begins with the ratio of unique tokens from the shorter string that are present in the longer string and then subtracts a penalty using a shifted sigmoid function (the logistic function) and the difference in length between the two token sets.

Parameters:
  • str1 (str) -- The first string to compare.

  • str2 (str) -- The second string to compare.

  • shift (int, default 4) -- Amount to shift sigmoid curve.

  • preprocess (bool, default True) -- Whether or not to run built-in preprocessing.

Raises:

ValueError -- If the input strings are empty after preprocessing.

Returns:

Shifted sigmoid similarity score.

Return type:

float

Examples

>>> shifted_sigmoid_similarity('apple', 'banana')
0.0
>>> shifted_sigmoid_similarity('apple', 'apple apple')
1.0