Fuzzy string matching using cosine similarity

Lately i've been dealing quite a bit with mining unstructured data1. This often involved determining the similarity of Strings and blocks of text. One of the more interesting algorithms i came across was the Cosine Similarity algorithm. It's a pretty popular way of quantifying the similarity of sequences by treating them as vectors and calculating their cosine. This delivers a value between 0 and 1; where 0 means no similarity whatsoever and 1 meaning that both sequences are exactly the same.

A bit on vectors

If you're anything like me, it's been a while since you've had to deal with any of this vector stuff so here is a quick refresher to get you up to speed.

Vectors are geometric elements that have magnitude (length) and direction. We can perform mathematical operations on them just as we would on regular (scalar) numbers as well as a few special operations of which we'll need two; the dot product and magnitude.

To keep things simple, let's consider two vectors

\[\overrightarrow{a} = (1, 2, 3)\]

and,

\[\overrightarrow{b} = (4, 5, 6)\]

Their dot product can be calculated as

\[\overrightarrow{a}\cdot \overrightarrow{b} = a_{1}b_{1} + a_{2}b_{2}+ \cdot\cdot\cdot + a_{n}b_{n} = \sum_{i=1}^n a_{i} b_{i}\]

in practice this would be,

\[\overrightarrow{a}\cdot\overrightarrow{b} = (1, 2, 3) \cdot (4, 5, 6) = 1 \times 4 + 2 \times 5 + 3 \times 6 = 32 \]

We can also calculate their respective magnitudes. Which are represented as,

\[\parallel \overrightarrow{a} \parallel = \sqrt{a_1^2 + a_2^2 + \cdot\cdot\cdot + a_n^2 }\]

in practice,

\[\parallel \overrightarrow{a} \parallel = \sqrt{ 1^{2}+2^{2}+3^{2} } = \sqrt{14} \approx 3.7166\]

and

\[\parallel \overrightarrow{b} \parallel = \sqrt{ 4^{2}+5^{2}+6^{2} } = \sqrt{77} \approx 8.77496\]

Calculating cosine similarity

Cosine similarity can be expressed mathematically as,

\[similarity(\overrightarrow{a} , \overrightarrow{b}) = \cos\theta = \frac{\overrightarrow{a} \cdot \overrightarrow{b} }{\parallel \overrightarrow{a} \parallel \parallel \overrightarrow{b} \parallel}\]

Expanding it a little further (I know... don't panic),

\[\frac{\overrightarrow{a} \cdot \overrightarrow{b}}{\parallel \overrightarrow{a} \parallel \parallel \overrightarrow{b} \parallel} = \frac{\sum_{i=1}^n a_{i} b_{i}}{\sqrt{\sum_{i=1}^n(a_{i})^2} \times {\sqrt{\sum_{i=1}^n(b_{i})^2}}}\]

This basically means,

  1. Get the dot product of vectors a and b
  2. Multiply magnitude a and magnitude b
  3. Divide the dot product of vectors a and b by the product of magnitude a and magnitude b.

Now that the math is out of the way, we can begin applying this algorithm to strings.

Working with strings

The first step is to turn our strings into vectors that we can work with. We do this by considering the Term Frequency of unique words or letters in a given string. Taking the following two strings as an example,

a. He is the hero Gotham deserves  
b. but not the one it needs right now.  

We can identify 13 unique words shared between these two strings. We build our vector by noting how many times each word occurred in each string.

      Word       a        b
1.    the        1        1  
2.    he         1        0  
3.    is         1        0  
4.    it         0        1  
5.    right      0        1  
6.    needs      0        1  
7.    but        0        1  
8.    hero       1        0  
9.    deserves   1        0  
10.   not        0        1  
11.   now        0        1  
12.   one        0        1  
13.   gotham     1        0  

Now we have two vectors

\[\overrightarrow{a} = (1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1)\]

\[\overrightarrow{b} = (1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0)\]

With this we can easily calculate the dot product and magnitudes of the vectors using the method that was illustrated earlier.

 a . b = 1, ||a|| = 2.44948974278, ||b|| = 2.82842712475

Cosine similarity = 1 / (2.44948974278) * (2.82842712475)  
Cosine similarity = 0.144337567297  

Well if it wasn't already obvious that those two strings were not very similar before (only sharing a common word; 'the'), now we have data to prove it.

For word cosine similarity, we consider unique characters in the string to construct our vector,

a. bob  
b. rob  

I'm sure you already see where this is going, but i'll show you anyway

      Letter       a        b
1.    b            2        1  
2.    o            1        1  
3.    r            0        1  

Now we just compute the cosine similarity using the vectors

a . b = 3, ||a|| = 2.2360679775, ||b|| = 1.73205080757

Cosine similarity = 3 / (2.2360679775) * (1.73205080757)  
Cosine similarity = 0.774596669241  

It turns out, 'bob' and 'rob' are indeed similar; approximately 77% according to the results.

It's important to note that this algorithm does not consider the order of the words, It only uses their frequency as a measure.

Diving into the code

I'm sure that you've guessed that it's actually pretty easy to turn this algorithm into working code. Here's my take on the solution2.

     /**
     * @param terms values to analyze
     * @return a map containing unique 
     * terms and their frequency
     */
    public static Map<String, Integer> getTermFrequencyMap(String[] terms) {
        Map<String, Integer> termFrequencyMap = new HashMap<>();
        for (String term : terms) {
            Integer n = termFrequencyMap.get(term);
            n = (n == null) ? 1 : ++n;
            termFrequencyMap.put(term, n);
        }
        return termFrequencyMap;
    }

    /**
     * @param text1 
     * @param text2 
     * @return cosine similarity of text1 and text2
     */
    public static double cosineSimilarity(String text1, String text2) {
        //Get vectors
        Map<String, Integer> a = getTermFrequencyMap(text1.split("\\W+"));
        Map<String, Integer> b = getTermFrequencyMap(text2.split("\\W+"));

        //Get unique words from both sequences
        HashSet<String> intersection = new HashSet<>(a.keySet());
        intersection.retainAll(b.keySet());

        double dotProduct = 0, magnitudeA = 0, magnitudeB = 0;

        //Calculate dot product
        for (String item : intersection) {
            dotProduct += a.get(item) * b.get(item);
        }

        //Calculate magnitude a
        for (String k : a.keySet()) {
            magnitudeA += Math.pow(a.get(k), 2);
        }

        //Calculate magnitude b
        for (String k : b.keySet()) {
            magnitudeB += Math.pow(b.get(k), 2);
        }

        //return cosine similarity
        return dotProduct / Math.sqrt(magnitudeA * magnitudeB);
    }

This code splits the given strings into words using the regex pattern \W+. To adapt this for characters, (?!^) should be used instead.

Using the algorithm for fuzzy string matching

As an experiment, i ran the algorithm against the OSX internal dictionary3 which contained about 235886 words. I also filtering out words with a similarity of less than 9. Running this against the keyword "hello" returned the following,

chello      index: 0.9354143466934853  
hell        index: 0.9258200997725514  
hellhole    index: 0.9799578870122228  
hello       index: 1.0  
helluo      index: 0.9354143466934853  
hole        index: 0.944911182523068  
holl        index: 0.9258200997725514  
holler      index: 0.9354143466934853  
molehill    index: 0.9091372900969896  
oilhole     index: 0.9116846116771036  
theelol     index: 0.9116846116771036  
wellhole    index: 0.944911182523068

12 of 235886 potential matches for 'hello'  

As an exact match, "hello" get's a value of 1, with second place going to "hellhole" and a tie for third between "wellhole" and "hole".
It's clear that this does a pretty good job of picking out potential matches. However, there are clear limitations in the approach. since the order of the letters in the string are not considered, words like wellhole were ranked about as high as hole. This is due to the Term Frequency of the characters in the word. This means that the results will need to be further refined if finer granularity is required.

Got something to add? Join the conversation on reddit

  1. Data that does not have any form of predefined model.

  2. I know it's not the most efficient program in the world, but it gets the job done. If you have a better way of doing this, please reach out to me with your idea.

  3. Located in /usr/share/dict/words