Jans Blog

Cleaning some data with Clojure 3

30.12.2014

The first post in this series was about the original problem: there is a data-set of strings with potential duplicates and many typos. The second was about computing the similarity of two strings. In this post we will focus on the question

Can we group multiple strings of a set by similarity?

Matching the names: The first step is to match the names. I hope, that the following code is documented good enough.

(defn- singleMap 
  "Produces a function that takes one argument str2 and computes
  [str1 str2 (f str1 str2)]."
  [f str1]
  (fn [str2]
    [str1 str2 (f str1 str2)]))

(defn- split? 
  "Produces a split predicate based on a."
  [str] 
  (fn [x] 
    (pos? (compare str x))))

(defn- getAfterStr 
  "Returns the elements after str of the supplied sequence"
  [str strings]
  (get (split-with (split? str) strings) 1))

(defn- singleMapWAll
  "Produces a producer function that maps the matching function
  with a to all elements after a."
  [f strings] 
  (fn 
    [str1] 
    (map (singleMap f str1) getAfterStr)))

(defn match
  "Uses the given similarity function f and matches all
  strings with each other string once"
  [f strings]
  ; Applys the function produced by singleMapWAll to strings
  ; and concatenates the result
  (mapcat (singleMapWAll f strings) strings)) 

The following table provides some more information. a, b, c, d, and e are the strings to be matched. “f str1” denotes that a function “f str1” is mapped onto the corresponding string in the last row (which will be used as str2 in the matching).

(f d)
(f c f c)
(f b f b f b)
(f a f a f a f a)
a b c d e

Whereas map would produce a mapping as above, mapcat applys the mapping and concatenates the results as displayed below.

(f a f a f a f a f b f b f b f c f c f d)
a b c d e c d e d e e

After all matching fs are applied, we get our result tuples in one list:

([a b (f a b)] [a c (f a c)] [d e (f d e)])

Filtering the results: This filtering step is just an intermediate step. We will use the results find good “grouping” values. Remember that a result vector has the form [str1 str2 (similarity str1 str2).

(defn- minSimFilterBuilder 
  [min-sim]
  (fn [x]
    (>= (get x 2) min-sim)))

(defn- unequalName? 
  [x]
  (not (= (get x 0) (get x 1))))

(defn cleanMatches 
  [matches min-sim]
   (filter (minSimFilterBuilder min-sim) 
    (filter unequalName? matches)))

minSimFilterBuilder is used to construct filter functions based on a minimum similarity value. The function unequalName? simply checks, if the names of the result tuple are different. The result of cleanMatches will then contain only matches with distinct strings that are “similar enough” for us.

Constructing clusters: sadg