# The setting

For a few years now I've had an itch to scratch, namely, finding a nice way of
keeping a normalized bibliographic database to use with LaTeX ^{1}.

BibTeX unfortunately lends itself to a lot of variability in how entries are stored, and in my experience, it's always been slightly painful to work with, especially when collaborating with other authors. Very often entries are incomplete, or repeated under different keys, or use inconsistent formatting (e.g., in how author names are handled, or titles capitalized).

I've been thinking a bit about how to lower the friction of dealing with these issues (and also finding a practical motivation to get some extra-curricular coding done), and part of the solution I have in mind is clj-bibtex, a small library for operating on BibTeX data.

Basically, the use case I have in mind for this is to load up a bunch of bibliography data, do some basic normalization on it (in particular author names), and load it into a Datascript DB, where I can easily query the data.

It is often the case that author names in BibTeX entries are misspelled, or
that a *"Last, F."* format is used instead of *"Last, First"*, etc. I would like
to quickly identify these cases, and later be able to merge those entities (in
the DB) so that a canonical name is available for each author. Likewise, it is
common for duplicate entries to find their way into `.bib`

files, with
different keys and small differences in the title fields (different
capitalization styles is a typical issue).

In order to solve these two issues, I had to look at string similarity metrics.

# The problem

What I needed to do then is find a way, given two strings, to see how similar they are. This is a problem that has been solved many times over, so I was not about to invent a new similarity metric, but rather find one that worked well in this particular context, and was simple (and preferably easy!) to implement.

The canonical string similarity (or distance) metric is the
Levenshtein distance, which is a so-called *edit distance* metric. Quoting the
wiki, this metric is

*"the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other."*

There are a number of other *edit distance* metrics, which allow different
operations to be considered, and hence work better or worse for different use
cases. For this use case, the Damerau-Levenshtein metric could be a better
fit, since it also allows for letter transposition (which is a common typo).

Upon closer inspection, I noticed that none of these algorithms lent itself to
a nice, efficient and TCO-friendly implementation. They can of course be
implemented with `loop`

or as a reduction, carrying a rather large
accumulator, but I was feeling a bit lazy, and I suspected the end result
would not be very nice to look at, so I decided to look around a bit
more. A few more minutes of googling brought me to the
Sørensen-Dice coefficient, which is, per the wiki:

*"a statistic used to gauge the similarity of two samples"*

and is defined as, given two sets $X$ and $Y$

$$ SDC = \frac{2\vert X \cap Y \vert}{\vert X \vert + \vert Y \vert} $$

# The solution

In order to make this work for strings, we need to somehow tokenize them.
Using single letters, or words, is unlikely to yield the results we want, but
if we consider *digrams* ^{2}, it produces very good results for the problem we are
trying to solve. As a bonus, the implementation is trivial.

First, we need to turn our strings into a set of digrams:

```
(require '[net.cgrand.xforms :as x])
(defn digrams
"Creates a set of all digrams in `s`"
[s]
(let [xform (comp
(x/partition 2 1)
(map #(apply str %)))]
(into #{} xform s)))
```

Then, we can simply use the resulting sets in the definition of the Sørensen-Dice coefficient (noting that it's undefined if both strings are empty, and that when either string is of length 1, the coefficient will either be 0, if the strings are different, or 1)

```
(require '[clojure.set :as set])
(defn sorensen-dice
"Computes the Sørenson-Dice coefficient between `s1` and `s2`.
At least one of the arguments must have length > 0."
[s1 s2]
{:pre [(or (> (count s1) 0)
(> (count s2) 0))]}
(if (or (= 1 (count s1))
(= 1 (count s2)))
(if (= s1 s2) 1.0 0.0)
(let [b1 (digrams s1)
b2 (digrams s2)]
(/ (* 2.0 (count (set/intersection b1 b2)))
(+ (count b1) (count b2))))))
```

We can put this to use on an actual `.bib`

file that has some of the
aforementioned problems. Assuming we have all the author names and all the paper
titles in `all-authors`

and `all-titles`

, respectively:

```
(defn find-similar [coll]
(let [ xform (comp
(x/sort)
(x/partition 2 1)
(map (fn[[s1 s2]] [s1 s2 (sorensen-dice s1 s2)]))
(filter (fn[[_ _ sd]] (> sd 0.7))))]
(into [] xform coll)))
(find-similar all-authors)
;;=>
;;[["Agapiou, G." "Agapiou, S." 0.8]
;; ["Alcaraz-Calero, J. M." "Alcaraz-Calero, Jose M" 0.8292682926829268]
;; ["Awobuluyi, O." "Awobuluyi, Olatunde" 0.7333333333333333]
;; ["Chen, Zhan" "Chen, Zhibo" 0.7368421052631579]
;; ["Ghinea, George" "Ghinea, Gheorghita" 0.7857142857142857]
;; ["Gong, Yan" "Gong, Yi" 0.8]
;; ["Guyard, Frederic" "Guyard, Frédéric" 0.7333333333333333]
;; ["Heegaard, P. E." "Heegaard, Poul" 0.7407407407407407]
;; ["Heegaard, Poul" "Heegaard, Poul E" 0.9285714285714286]
;; ["Heegaard, Poul E" "Heegaard, Poul E." 0.967741935483871]
;; ["Hoßfeld, T." "Hoßfeld, Tobias" 0.75]
;; ["Kara, Peter A" "Kara, Peter A." 0.96]
;; ["Liu, Xi" "Liu, Xin" 0.9230769230769231]
;; ["Martini, Maria G" "Martini, Maria G." 0.9629629629629629]
;; ["Metzger, F." "Metzger, Florian" 0.72]
;; ["Nightingale, Edmund B" "Nightingale, J." 0.7058823529411765]
;; ["Nightingale, J." "Nightingale, James" 0.8387096774193549]
;; ["Schatz, R" "Schatz, Raimund" 0.7272727272727273]
;; ["Schatz, Raimund" "Schatz, Raimund." 0.9655172413793104]
;; ["Shafiq, M. Zubair" "Shafiq, Muhammad Zubair" 0.7567567567567568]
;; ["Skorin-Kapov, L." "Skorin-Kapov, Lea" 0.9032258064516129]
;; ["Varela, M." "Varela, Martin" 0.7619047619047619]
;; ["Varela, Martin" "Varela, Martn" 0.8695652173913043]
;; ["Wang, Ning" "Wang, Ping" 0.75]
;; ["Wang, Q." "Wang, Qi" 0.8571428571428571]
;; ["Yang, Zhe" "Yang, Zhen" 0.9411764705882353]
;; ["Zhang, Lin" "Zhang, Ping" 0.7777777777777778]]
(find-similar all-titles)
;;=>
;;[["Adaptive psychometric scaling for video quality assessment"
;; "Adaptive testing for video quality assessment"
;; 0.8085106382978723]
;; ["OTT-ISP Joint Service Management: A Customer Lifetime Value Based Approach"
;; "OTT-ISP Joint service management: a customer lifetime value based approach"
;; 0.7786259541984732]
;; ["OTT-ISP Joint service management: a customer lifetime value based approach"
;; "OTT-ISP joint service management: a Customer Lifetime Value based approach "
;; 0.8702290076335878]
;; ["Understanding the impact of network dynamics on mobile video user engagement"
;; "Understanding the impact of video quality on user engagement"
;; 0.743801652892562]]
```

Even though there are some false positives, it does seem to capture the
problematic fields pretty well. Actually, we can play a bit with the threshold,
since based on my testing so far, titles and authors could use different
thresholds, so the actual functions in `clj-bibtex`

that take care of this take
an optional fuzziness argument.