Levenshtein Distance – An Algorithm that gets you caught in your online exams

When I had to do similarity check between two strings in a project that I was working on, I came across this algorithm. My senior dev explained me this algorithm in very simple terms,

Levenshtein distance between two strings is nothing but the number of single-character edits the first string is away from becoming the second

Named after the soviet mathematician Vladimir Levenshtein, Levenshtein distance (a.k.a edit distance) can been seen in modern applications such as

  • Spell checking
  • DNA analysis
  • Speech recognition
  • Plagiarism detection

Yes, this guy is the reason you got caught when you copied something from your friend’s assignment. An algorithm developed in the 1960’s during the soviet era, is still in a way is being used in your google keyboard, smart devices and smart classroom frameworks.


Now enough of the details, here’s the mathematical notation of the algorithm

To the better understanding for all the coders out there, let’s define this as a python function, shall we?

<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>def</span> <span>levi_dist</span><span>(</span><span>a</span><span>,</span><span>b</span><span>):</span>
<span>'''</span><span> takes two strings as arguments, need not be of same length - a,b returns the levenshtein distance between the two strings </span><span>'''</span>
<span># Initializing a matrix of zeros </span> <span>rows</span> <span>=</span> <span>len</span><span>(</span><span>a</span><span>)</span> <span>+</span> <span>1</span>
<span>cols</span> <span>=</span> <span>len</span><span>(</span><span>b</span><span>)</span> <span>+</span> <span>1</span>
<span>arr</span> <span>=</span> <span>np</span><span>.</span><span>zeros</span><span>((</span><span>rows</span><span>,</span><span>cols</span><span>),</span><span>dtype</span><span>=</span><span>int</span><span>)</span>
<span># Initializing the first row and first column with string indices </span> <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>rows</span><span>):</span>
<span>for</span> <span>j</span> <span>in</span> <span>range</span><span>(</span><span>cols</span><span>):</span>
<span>arr</span><span>[</span><span>i</span><span>][</span><span>0</span><span>]</span> <span>=</span> <span>i</span>
<span>arr</span><span>[</span><span>0</span><span>][</span><span>j</span><span>]</span> <span>=</span> <span>j</span>
<span># Iterating over the elements of the matrix except index column and row </span> <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>1</span><span>,</span><span>rows</span><span>):</span>
<span>for</span> <span>j</span> <span>in</span> <span>range</span><span>(</span><span>1</span><span>,</span><span>cols</span><span>):</span>
<span>if</span> <span>a</span><span>[</span><span>i</span><span>-</span><span>1</span><span>]</span> <span>==</span> <span>b</span><span>[</span><span>j</span><span>-</span><span>1</span><span>]:</span>
<span>cost</span> <span>=</span> <span>0</span>
<span>else</span><span>:</span>
<span>cost</span> <span>=</span> <span>1</span>
<span>arr</span><span>[</span><span>i</span><span>][</span><span>j</span><span>]</span> <span>=</span> <span>min</span><span>(</span><span>arr</span><span>[</span><span>i</span><span>-</span><span>1</span><span>][</span><span>j</span><span>]</span> <span>+</span> <span>1</span><span>,</span> <span># deletion </span> <span>arr</span><span>[</span><span>i</span><span>][</span><span>j</span><span>-</span><span>1</span><span>]</span> <span>+</span> <span>1</span><span>,</span> <span># insertion </span> <span>arr</span><span>[</span><span>i</span><span>-</span><span>1</span><span>][</span><span>j</span><span>-</span><span>1</span><span>]</span> <span>+</span> <span>cost</span><span>)</span> <span># replacement </span>
<span>return</span> <span>f</span><span>'</span><span>Levenshtein Distance between string A and string B is </span><span>{</span><span>arr</span><span>[</span><span>len</span><span>(</span><span>a</span><span>)][</span><span>len</span><span>(</span><span>b</span><span>)]</span><span>}</span><span>'</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span>def</span> <span>levi_dist</span><span>(</span><span>a</span><span>,</span><span>b</span><span>):</span>
    <span>'''</span><span> takes two strings as arguments, need not be of same length - a,b returns the levenshtein distance between the two strings </span><span>'''</span>

    <span># Initializing a matrix of zeros </span>    <span>rows</span> <span>=</span> <span>len</span><span>(</span><span>a</span><span>)</span> <span>+</span> <span>1</span>
    <span>cols</span> <span>=</span> <span>len</span><span>(</span><span>b</span><span>)</span> <span>+</span> <span>1</span>
    <span>arr</span> <span>=</span> <span>np</span><span>.</span><span>zeros</span><span>((</span><span>rows</span><span>,</span><span>cols</span><span>),</span><span>dtype</span><span>=</span><span>int</span><span>)</span>

    <span># Initializing the first row and first column with string indices </span>    <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>rows</span><span>):</span>
        <span>for</span> <span>j</span> <span>in</span> <span>range</span><span>(</span><span>cols</span><span>):</span>
            <span>arr</span><span>[</span><span>i</span><span>][</span><span>0</span><span>]</span> <span>=</span> <span>i</span>
            <span>arr</span><span>[</span><span>0</span><span>][</span><span>j</span><span>]</span> <span>=</span> <span>j</span>

    <span># Iterating over the elements of the matrix except index column and row </span>    <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>1</span><span>,</span><span>rows</span><span>):</span>
        <span>for</span> <span>j</span> <span>in</span> <span>range</span><span>(</span><span>1</span><span>,</span><span>cols</span><span>):</span>
            <span>if</span> <span>a</span><span>[</span><span>i</span><span>-</span><span>1</span><span>]</span> <span>==</span> <span>b</span><span>[</span><span>j</span><span>-</span><span>1</span><span>]:</span>
                <span>cost</span> <span>=</span> <span>0</span>
            <span>else</span><span>:</span>
                <span>cost</span> <span>=</span> <span>1</span>

            <span>arr</span><span>[</span><span>i</span><span>][</span><span>j</span><span>]</span> <span>=</span> <span>min</span><span>(</span><span>arr</span><span>[</span><span>i</span><span>-</span><span>1</span><span>][</span><span>j</span><span>]</span> <span>+</span> <span>1</span><span>,</span>     <span># deletion </span>                        <span>arr</span><span>[</span><span>i</span><span>][</span><span>j</span><span>-</span><span>1</span><span>]</span> <span>+</span> <span>1</span><span>,</span>         <span># insertion </span>                        <span>arr</span><span>[</span><span>i</span><span>-</span><span>1</span><span>][</span><span>j</span><span>-</span><span>1</span><span>]</span> <span>+</span> <span>cost</span><span>)</span>    <span># replacement </span>
    <span>return</span> <span>f</span><span>'</span><span>Levenshtein Distance between string A and string B is </span><span>{</span><span>arr</span><span>[</span><span>len</span><span>(</span><span>a</span><span>)][</span><span>len</span><span>(</span><span>b</span><span>)]</span><span>}</span><span>'</span> 
import numpy as np def levi_dist(a,b): ''' takes two strings as arguments, need not be of same length - a,b returns the levenshtein distance between the two strings ''' # Initializing a matrix of zeros rows = len(a) + 1 cols = len(b) + 1 arr = np.zeros((rows,cols),dtype=int) # Initializing the first row and first column with string indices for i in range(rows): for j in range(cols): arr[i][0] = i arr[0][j] = j # Iterating over the elements of the matrix except index column and row for i in range(1,rows): for j in range(1,cols): if a[i-1] == b[j-1]: cost = 0 else: cost = 1 arr[i][j] = min(arr[i-1][j] + 1, # deletion arr[i][j-1] + 1, # insertion arr[i-1][j-1] + cost) # replacement return f'Levenshtein Distance between string A and string B is {arr[len(a)][len(b)]}'

Enter fullscreen mode Exit fullscreen mode

Great! now that we’ve defined our function, We would need two strings to test this, Here in my case I’ll take monty as string A and python as string B. You can choose any two strings of your choice.

calling the function with our strings, we get

> levi_dist('monty','python')
Levenshtein Distance between string A and string B is 6
 > levi_dist('monty','python')
 Levenshtein Distance between string A and string B is 6
> levi_dist('monty','python') Levenshtein Distance between string A and string B is 6

Enter fullscreen mode Exit fullscreen mode

This means that monty is 6 single-character edits away from becoming python.

and here’s how the matrix would look like

There we go, we accomplished our task. Now, If you are lazy like me and you don’t want to write the algorithm on your own, you can just leverage some of the functions in text transformation modules like jellyfish, enchant etc.. which also does the same thing.

> from jellyfish import levenshtein_distance
> levenshtein_distance('monty','python')
6
 > from jellyfish import levenshtein_distance
 > levenshtein_distance('monty','python')
 6
> from jellyfish import levenshtein_distance > levenshtein_distance('monty','python') 6

Enter fullscreen mode Exit fullscreen mode

Cheers! will see you in the next post.

原文链接:Levenshtein Distance – An Algorithm that gets you caught in your online exams

© 版权声明
THE END
喜欢就支持一下吧
点赞15 分享
With the wonder of your love, the sun above always shines.
拥有你美丽的爱情,太阳就永远明媚
评论 抢沙发

请登录后发表评论

    暂无评论内容