There are many applications of string similarity comparison, such as spelling correction, text de duplication, context similarity and so on.

The most common way to evaluate the similarity of a string is to change a string into another string by inserting, deleting or replacing it. The minimum number of edits required is the edit distance measurement method, also known as Levenshtein Distance. Hemingway distance is a special case of editing distance. It can only be used to measure the distance between two equal length strings.

Other commonly used measurement methods include Jaccard distance, J-W distance, cosine similarity, Euclidean distance, etc.

Python Levenshtein using

Use the PIP install Python Levenshtein instruction to install Levenshtein

# -*- coding: utf-8 -*- import difflib # import jieba import Levenshtein str1 = "My bones are white, and I can't grow highland barley" str2 = "I only want to go in the snow on a snowy day si" # 1. difflib seq = difflib.SequenceMatcher(None, str1,str2) ratio = seq.ratio() print 'difflib similarity1: ', ratio # difflib removes characters from the list that do not need to be compared seq = difflib.SequenceMatcher(lambda x: x in ' My snow', str1,str2) ratio = seq.ratio() print 'difflib similarity2: ', ratio # 2. hamming distance. The length of str1 and str2 must be the same. It describes the number of different characters in the corresponding position between two equal length strings # sim = Levenshtein.hamming(str1, str2) # print 'hamming similarity: ', sim # 3. Edit distance, which describes the minimum number of operations for converting one string to another, including insertion, deletion and replacement sim = Levenshtein.distance(str1, str2) print 'Levenshtein similarity: ', sim # 4. Calculation of levinstein ratio sim = Levenshtein.ratio(str1, str2) print 'Levenshtein.ratio similarity: ', sim # 5. Calculate jaro distance sim = Levenshtein.jaro(str1, str2 ) print 'Levenshtein.jaro similarity: ', sim # 6. Jaro – Winkler distance sim = Levenshtein.jaro_winkler(str1 , str2 ) print 'Levenshtein.jaro_winkler similarity: ', sim

Output:

difflib similarity1: 0.246575342466 difflib similarity2: 0.0821917808219 Levenshtein similarity: 33 Levenshtein.ratio similarity: 0.27397260274 Levenshtein.jaro similarity: 0.490208958959 Levenshtein.jaro_winkler similarity: 0.490208958959