Several measures of Python string similarity

There are many applications of string similarity comparison, such as spelling correction, text de duplication, context similarity and so on.

The most common way to evaluate the similarity of a string is to change a string into another string by inserting, deleting or replacing it. The minimum number of edits required is the edit distance measurement method, also known as Levenshtein Distance. Hemingway distance is a special case of editing distance. It can only be used to measure the distance between two equal length strings.

Other commonly used measurement methods include Jaccard distance, J-W distance, cosine similarity, Euclidean distance, etc.

Python Levenshtein using

Use the PIP install Python Levenshtein instruction to install Levenshtein

# -*- coding: utf-8 -*-

import difflib
# import jieba
import Levenshtein

str1 = "My bones are white, and I can't grow highland barley"
str2 = "I only want to go in the snow on a snowy day si"

# 1. difflib
seq = difflib.SequenceMatcher(None, str1,str2)
ratio = seq.ratio()
print 'difflib similarity1: ', ratio

# difflib removes characters from the list that do not need to be compared
seq = difflib.SequenceMatcher(lambda x: x in ' My snow', str1,str2)
ratio = seq.ratio()
print 'difflib similarity2: ', ratio

# 2. hamming distance. The length of str1 and str2 must be the same. It describes the number of different characters in the corresponding position between two equal length strings
# sim = Levenshtein.hamming(str1, str2)
# print 'hamming similarity: ', sim

# 3. Edit distance, which describes the minimum number of operations for converting one string to another, including insertion, deletion and replacement
sim = Levenshtein.distance(str1, str2)
print 'Levenshtein similarity: ', sim

# 4. Calculation of levinstein ratio
sim = Levenshtein.ratio(str1, str2)
print 'Levenshtein.ratio similarity: ', sim

# 5. Calculate jaro distance
sim = Levenshtein.jaro(str1, str2 )
print 'Levenshtein.jaro similarity: ', sim

# 6. Jaro – Winkler distance
sim = Levenshtein.jaro_winkler(str1 , str2 )
print 'Levenshtein.jaro_winkler similarity: ', sim

Output:

difflib similarity1:  0.246575342466
difflib similarity2:  0.0821917808219
Levenshtein similarity:  33
Levenshtein.ratio similarity:  0.27397260274
Levenshtein.jaro similarity:  0.490208958959
Levenshtein.jaro_winkler similarity:  0.490208958959

Tags: Python pip Lambda

Posted on Tue, 05 Nov 2019 11:18:14 -0800 by michelledebeer