String Similarity Made Easy

String Similarity Made Easy

The setting

For a few years now I've had an itch to scratch, namely, finding a nice way of keeping a normalized bibliographic database to use with LaTeX 1.

BibTeX unfortunately lends itself to a lot of variability in how entries are stored, and in my experience, it's always been slightly painful to work with, especially when collaborating with other authors. Very often entries are incomplete, or repeated under different keys, or use inconsistent formatting (e.g., in how author names are handled, or titles capitalized).

I've been thinking a bit about how to lower the friction of dealing with these issues (and also finding a practical motivation to get some extra-curricular coding done), and part of the solution I have in mind is clj-bibtex, a small library for operating on BibTeX data.

Basically, the use case I have in mind for this is to load up a bunch of bibliography data, do some basic normalization on it (in particular author names), and load it into a Datascript DB, where I can easily query the data.

It is often the case that author names in BibTeX entries are misspelled, or that a "Last, F." format is used instead of "Last, First", etc. I would like to quickly identify these cases, and later be able to merge those entities (in the DB) so that a canonical name is available for each author. Likewise, it is common for duplicate entries to find their way into .bib files, with different keys and small differences in the title fields (different capitalization styles is a typical issue).

In order to solve these two issues, I had to look at string similarity metrics.

The problem

What I needed to do then is find a way, given two strings, to see how similar they are. This is a problem that has been solved many times over, so I was not about to invent a new similarity metric, but rather find one that worked well in this particular context, and was simple (and preferably easy!) to implement.

The canonical string similarity (or distance) metric is the Levenshtein distance, which is a so-called edit distance metric. Quoting the wiki, this metric is

"the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other."

There are a number of other edit distance metrics, which allow different operations to be considered, and hence work better or worse for different use cases. For this use case, the Damerau-Levenshtein metric could be a better fit, since it also allows for letter transposition (which is a common typo).

Upon closer inspection, I noticed that none of these algorithms lent itself to a nice, efficient and TCO-friendly implementation. They can of course be implemented with loop or as a reduction, carrying a rather large accumulator, but I was feeling a bit lazy, and I suspected the end result would not be very nice to look at, so I decided to look around a bit more. A few more minutes of googling brought me to the Sørensen-Dice coefficient, which is, per the wiki:

"a statistic used to gauge the similarity of two samples"

and is defined as, given two sets $X$ and $Y$

$$ SDC = \frac{2\vert X \cap Y \vert}{\vert X \vert + \vert Y \vert} $$

The solution

In order to make this work for strings, we need to somehow tokenize them. Using single letters, or words, is unlikely to yield the results we want, but if we consider digrams 2, it produces very good results for the problem we are trying to solve. As a bonus, the implementation is trivial.

First, we need to turn our strings into a set of digrams:

  (require '[net.cgrand.xforms :as x])

  (defn digrams
    "Creates a set of all digrams in `s`"
    [s]
    (let [xform (comp
                 (x/partition 2 1)
                 (map #(apply str %)))]
       (into #{} xform s)))


Then, we can simply use the resulting sets in the definition of the Sørensen-Dice coefficient (noting that it's undefined if both strings are empty, and that when either string is of length 1, the coefficient will either be 0, if the strings are different, or 1)

  (require '[clojure.set :as set])

  (defn sorensen-dice
      "Computes the Sørenson-Dice coefficient between `s1` and `s2`.
      At least one of the arguments must have length > 0."
    [s1 s2]
    {:pre [(or (> (count s1) 0)
               (> (count s2) 0))]}
    (if (or (= 1 (count s1))
            (= 1 (count s2)))
      (if (= s1 s2) 1.0 0.0)
      (let [b1 (digrams s1)
            b2 (digrams s2)]
        (/ (* 2.0 (count (set/intersection b1 b2)))
           (+ (count b1) (count b2))))))

We can put this to use on an actual .bib file that has some of the aforementioned problems. Assuming we have all the author names and all the paper titles in all-authors and all-titles, respectively:

  (defn find-similar [coll]
    (let [ xform (comp
                  (x/sort)
                  (x/partition 2 1)
                  (map (fn[[s1 s2]] [s1 s2 (sorensen-dice s1 s2)]))
                  (filter (fn[[_ _ sd]] (> sd 0.7))))]
      (into [] xform coll)))

  (find-similar all-authors)
  ;;=>
  ;;[["Agapiou, G." "Agapiou, S." 0.8]
  ;; ["Alcaraz-Calero, J. M." "Alcaraz-Calero, Jose M" 0.8292682926829268]
  ;; ["Awobuluyi, O." "Awobuluyi, Olatunde" 0.7333333333333333]
  ;; ["Chen, Zhan" "Chen, Zhibo" 0.7368421052631579]
  ;; ["Ghinea, George" "Ghinea, Gheorghita" 0.7857142857142857]
  ;; ["Gong, Yan" "Gong, Yi" 0.8]
  ;; ["Guyard, Frederic" "Guyard, Frédéric" 0.7333333333333333]
  ;; ["Heegaard, P. E." "Heegaard, Poul" 0.7407407407407407]
  ;; ["Heegaard, Poul" "Heegaard, Poul E" 0.9285714285714286]
  ;; ["Heegaard, Poul E" "Heegaard, Poul E." 0.967741935483871]
  ;; ["Hoßfeld, T." "Hoßfeld, Tobias" 0.75]
  ;; ["Kara, Peter A" "Kara, Peter A." 0.96]
  ;; ["Liu, Xi" "Liu, Xin" 0.9230769230769231]
  ;; ["Martini, Maria G" "Martini, Maria G." 0.9629629629629629]
  ;; ["Metzger, F." "Metzger, Florian" 0.72]
  ;; ["Nightingale, Edmund B" "Nightingale, J." 0.7058823529411765]
  ;; ["Nightingale, J." "Nightingale, James" 0.8387096774193549]
  ;; ["Schatz, R" "Schatz, Raimund" 0.7272727272727273]
  ;; ["Schatz, Raimund" "Schatz, Raimund." 0.9655172413793104]
  ;; ["Shafiq, M. Zubair" "Shafiq, Muhammad Zubair" 0.7567567567567568]
  ;; ["Skorin-Kapov, L." "Skorin-Kapov, Lea" 0.9032258064516129]
  ;; ["Varela, M." "Varela, Martin" 0.7619047619047619]
  ;; ["Varela, Martin" "Varela, Martn" 0.8695652173913043]
  ;; ["Wang, Ning" "Wang, Ping" 0.75]
  ;; ["Wang, Q." "Wang, Qi" 0.8571428571428571]
  ;; ["Yang, Zhe" "Yang, Zhen" 0.9411764705882353]
  ;; ["Zhang, Lin" "Zhang, Ping" 0.7777777777777778]]

  (find-similar all-titles)
  ;;=>
  ;;[["Adaptive psychometric scaling for video quality assessment"
  ;;  "Adaptive testing for video quality assessment"
  ;;  0.8085106382978723]
  ;; ["OTT-ISP Joint Service Management: A Customer Lifetime Value Based Approach"
  ;;  "OTT-ISP Joint service management: a customer lifetime value based approach"
  ;;  0.7786259541984732]
  ;; ["OTT-ISP Joint service management: a customer lifetime value based approach"
  ;;  "OTT-ISP joint service management: a Customer Lifetime Value based approach "
  ;;  0.8702290076335878]
  ;; ["Understanding the impact of network dynamics on mobile video user engagement"
  ;;  "Understanding the impact of video quality on user engagement"
  ;;  0.743801652892562]]

Even though there are some false positives, it does seem to capture the problematic fields pretty well. Actually, we can play a bit with the threshold, since based on my testing so far, titles and authors could use different thresholds, so the actual functions in clj-bibtex that take care of this take an optional fuzziness argument.

Footnotes


  1. I am aware of Zotero, and other similar tools, but they never seem to stick, for some reason, and they definitely don't satisfy my NIH syndrome. I should try it again, though... ^
  2. This is the standard way to implement Sørensen-Dice for string similarity, by the way. ^
Avatar
Martín Varela
QoE Guy

I work mainly in Quality of Experience (QoE) for online services, and QoE management.