A mid-sized insurance company based in Peoria IL was in need of a technique to group together the policies stored in their database by the unique client who owned the policy. Unfortunately, insurance laws required that new policies be entered manually into the database. Consequently, many of their policies had misspelled names, inverted characters in their SSNs, improper addresses, etc. In order to counter this challenge, they were in need of a way to fuzzy-cluster their policies by the client to account for the misspellings.
A common problem for database administrators is that of grouping unique entities when the integrity of the records is questionable. In other words, names might be missing or mispelled, records might be incomplete but still individually identifiable using human intuition. With tens of thousands of records to identity, human intuition must be supplemented with technical muscle. One approach to this problem is to apply Levenshtein distance across the fields of two records to determine if they are related. Groups will be formed by similarity and weights will be applied to the data fields according to their value to us (e.g. SNN is more identifiable than suffix or zip code).
The synthesized data that we will be studying comes in the form of records. I have simulated data fields with names, addresses and SSNs. There are multiple groups within this dataset and therefore multiple groups should be identified by our algorithm. Many of these identities are one to one such that a single identity has a single record but several identities are described by several records in a one to many scenario. It is my hope that my algorithm will successfully identify these relationships.
FIRST | MIDDLE | LAST | ADDRESS | CITY | STATE | ZIP | SSN | |
---|---|---|---|---|---|---|---|---|
0 | HARVEY | B | MILK | 1930 RECRUIT YOU AVE | SAN FRANCISCO | CA | 18896.0 | 314159265.0 |
1 | HARVEY | B | MILK | 1930 RECRUIT YOU AVE | SF | CA | 18896.0 | 314159265.0 |
2 | H | B | MILK | 1930 RECRUIT YOU AVE | SF | CA | NaN | 314159265.0 |
3 | H | NaN | NaN | 1930 RECRUIT U | NaN | NaN | 18896.0 | NaN |
4 | RICK | P | SANCHEZ | EARTH C-137 | TACOMA | WA | 11893.0 | 929601596.0 |
5 | PYOTR | I | TCHAIKOVSKY | 1840 N ST PETER | VOTKINSK | GA | 11893.0 | 155633999.0 |
6 | RUPERT | L | DONLEY | 518 EVEREST BLVD | UDEMY | FL | 12358.0 | 123789235.0 |
7 | ELEANOR | P | TWINE | 7878 SEVEN EIGHT | SMALL TOWN | LA | 26916.0 | 148978105.0 |
8 | MICHELL | M | MENENDEZ | 58 CLOUDLESS SKY WAY | NaN | NaN | 32156.0 | 203254687.0 |
9 | CHARLES | F | KANE | 794 BROKEN HOME AVE | ROSEBUD | PA | 56465.0 | 489324789.0 |
10 | HOLDEN | K | CAULFIELD | 79 IRONICAL DRIVE | INNOCENCE | PA | 45687.0 | 102354548.0 |
11 | ATTICUS | J | FINCH | 1930 JUSTICE AVENUE | MAYCOMB | AL | 36460.0 | 121611201.0 |
12 | KIT | H | SHARBER | PARADISE PLACE | ALMA | OR | 10002.0 | 485943315.0 |
13 | PAUL | MUAD'DIB | ATREIDES | 314 PALACIAL ST | CALADAN | CA | 15587.0 | 111555888.0 |
14 | PAUL | MAWDIB | ATRAYDEZ | PALACIAL ST | CALADAN | CA | 15587.0 | 111555888.0 |
15 | PAUL | MAWDIB | ATRAYDEZ | 314 PALACIAL STREET | CALIDAN | CA | 15587.0 | 111555888.0 |
16 | MICHEAL | J | BURRY | 20400 STEVENS CREEK BLVD | CUPERTINO | CA | 95014.0 | 158971324.0 |
17 | MIKE | NaN | BURRY | NaN | CUPERTINO | CA | NaN | 158971324.0 |
My process for clustering will follow a standard card analogy. Imagine that we have a deck of cards and we wish to cluster them. By what? Suit? Royalty? Value? We say that we want to sort by suit but that the symbols for the suite have been obscured and disfigured. We know that hearts and diamonds, clubs and spades look similar but that there is a way to compare their similarity. Our process follows:
This process ensures that a new group can only be formed by failing to belong to an existing group.
Obviously, the card analogy does not entirely apply to the specifics of this problem. We have records with features (entries for a column) to identify them, some of which matter more to us than others for the purpose of identification.
$R_n = \{ \; f_0, f_1, f_N \; \}$
We express our system of priority by assigning weights to each feature by order of significance, like so:
$\vec{ \omega } = \omega_0 + \omega_1 + \omega_N = \sum_{n = 0}^{N} \omega_n = 1.0$
Individual features of a record can be compared to the features of another record using Levenshtein distance. A ratio can be obtained which refects the percentage of similarity between the features. To express the similarity between the records however, we must apply the weights to the individual feature similarities. We can find the similarity $\text{s}$:
$R_a = \{ \; f_{a0}, f_{a1}, ... \; f_{aN} \; \} \\ R_b = \{ \; f_{b0}, f_{b1}, ... \; f_{bN} \; \}$
$\text{s} = \omega_0 \cdot \text{Lev}( f_{a0}, f_{b0} ) \; + \; \omega_1 \cdot \text{Lev}( f_{a1}, f_{b1} ) \; + \; ... \; \omega_N \cdot \text{Lev}( f_{aN}, f_{bN} )$
If $R_a$ is an existing group unto itself and $\text{s}$ is greater than some threshhold big $\text{S}$ then $R_b$ joins the group for $R_a$. Then a new record $R_c$ is compared to the center of $R_a$ (more on this later). The process repeats.
However, if $\text{s} < \text{S}$, then $R_b$ becomes its own group. Then record $R_c$ is compared to records $R_a$ and $R_b$ and the process repeats.
If a record has missing features, then the algorithm suffers a statistical nightmare. For example, imagine that we have two records $R_a$ and $R_b$ with four features which are equal in all respects except for one: $f_{a2} = \text{NaN} \neq f_{b2}$. If all weights are adjusted equally, then the best similarity that these records could obtain would be a 75%, which might be too low to exceed the threshhold. If we want these records to be equated, then we must either lower our standards of adjust the weights, but how?
When a feature is missing, the algorithm drops the feature from the comparison and adjusts the weights while preserving the biasing. Say that feature number 2 is missing, then let $\vec{ \omega_{A} }$ represent the adjusted weights:
$\text{Let} \; \vec{ \omega } = \left[ \omega_0, \omega_1, \omega_2, \omega_3 \right] \; \text{s.t.} \; \sum_{n = 0}^{N} \omega_n = 1.0$
$\vec{ \omega_{A} } = [ \omega_0, \omega_1, 0, \omega_3 ] \longrightarrow \left[ \frac{\omega_0}{\omega_0 + \omega_1+ \omega_3}, \frac{\omega_1}{\omega_0 + \omega_1+ \omega_3}, 0, \frac{\omega_3}{\omega_0 + \omega_1+ \omega_3} \right]$
Using numeric values, let's say that $\vec{ \omega } = [ 0.20, 0.20, 0.50, 0.10 ]$ with feature number 2 missing, then:
$\vec{ \omega_{A} } = [ 0.20, 0.20, 0, 0.10 ] \longrightarrow \left[ \frac{0.20}{0.20 + 0.20 + 0.10}, \frac{0.20}{0.20 + 0.20 + 0.10}, 0, \frac{0.10}{0.20 + 0.20 + 0.10} \right] = [ 0.40, 0.40, 0.20 ]$
The first and second weights maintain their proportionality to the final weight. Our algorithm has adjusted without loosing its assigned biases.
One issue facing our algorithm lies in the representation of groups. Storing groups programmatically is an issue unto itself, but comparing individual records against a group is the key challenge. With numeric data, there exists an idea of the centroid: an n-dimensional center of mass for a set of records. With text data, however, the centroid becomes more abstract.
There are two approaches for finding the best representative of a group identity:
We will use approach 1 for groups with 2 or fewer records and approach 2 for groups with three or more records.
Another issue facing this system is how to store the hierarchy. My solution to this problem is to have a separate table with a column for group number, record number, and a flag for the centroid. Then we can quickly select a group number, obtain its records, and query the main dataframe for the record level details. We can manage the hierarchy efficiently without needing to pull the heavier records through the query, this approach should lower the computation requirements.
Prior to developing sets of functions to execute my proposed algorithm, I want to develop the process to prove that it works in its raw form. Once I have established the viability of this route I can proceed to the next stage.
As described in the previous section, missing fields in a record can complicate the comparison process. However, the weights can be readjusted to preserve the balance such that they remain proportional to each other and still add up to a whole. First, we setup the weights used for establishing the bias or preference towards one field of a record over another. The balance of the weights is verified before we continue.
>>> Omega = [ 0.18, 0.15, 0.20, 0.05, 0.05, 0.10, 0.02, 0.25 ]
>>> if np.sum( Omega ) != 1:
print('Weights are not in balance.')
We select two records from the dataset for our example comparison rec_ind=[1,2]
.
FIRST | MIDDLE | LAST | ADDRESS | CITY | STATE | ZIP | SSN | |
---|---|---|---|---|---|---|---|---|
1 | HARVEY | B | MILK | 1930 RECRUIT YOU AVE | SF | CA | 18896.0 | 314159265.0 |
2 | H | B | MILK | 1930 RECRUIT YOU AVE | SF | CA | NaN | 314159265.0 |
The first record is not missing any fields but the second record is missing one field (field 6, ZIP
), which we detect below.
>>> for i in range( 0, len(rec_ind) ):
true_ind = np.where( Data.iloc[ rec_ind[i] ].isnull().values == True )
status = 'Missing {}'.format( *true_ind )
print("R{} \t: {}".format(rec_ind[i], status) )
R1 : Missing []
R2 : Missing [6]
The weights are adjusted according to the methods described in the previous section, we simply divide the weights by the sum of the weights for non-missing records.
>>> Omega_A = Omega.copy()
>>> print('Original: \t{}'.format( Omega ) )
>>> for i in true_ind[0]:
Omega_A[ i ] = 0
>>> Omega_A = Omega_A / round( np.sum( [ Omega_A[i] for i in np.nonzero( Omega_A )[0] ] ), 20 )
>>> print('Adjusted: \t{}'.format( Omega_A ))
>>> if not math.isclose( np.sum( Omega_A ), 1, abs_tol = 1e-100 ):
print('Weights are not in balance.')
Original: [0.18, 0.15, 0.2, 0.05, 0.05, 0.1, 0.02, 0.25]
Adjusted: [0.18367347 0.15306122 0.20408163 0.05102041 0.05102041 0.10204082 0.0000000 0.25510204]
Checking these weights, we can see that dividing any two weights ($\omega$) by each other from the original/initial set results in the same proportion ($p$) as the same weights divided from the adjusted set $\omega_{im} / \omega_{in} = p_{mn} = \omega_{am} / \omega_{an}$. Proportion of bias is preserved for missing fields.
Now the similarity of the records can be calculated. The similarity threshold ($S$) is set and the Levenshtein ratios for the individual fields ($\vec{s}$) are calculated. The sum-product of the similarity scores and the weights reflect the aggregate similarity score ($s$) for the records. If $s >= S$ then the records are said to be sufficiently similar according to the weights and threshold set for the calculation.
>>> S = 85
>>> s = np.array([])
>>> for j in range(0, len(Data.columns)):
s = np.append(s, fuzz.ratio( Data.iloc[ rec_ind[0] ].apply(str)[j], Data.iloc[ rec_ind[1] ].apply(str)[j] ) )
>>> print( 'Lev Ratios: {}'.format(s) )
>>> s = np.inner( s, Omega_A )
>>> print( 'Similarity: {}'.format(s) )
Lev Ratios: [ 29. 100. 100. 100. 100. 100. 0. 100.]
Similarity: 86.95918367346937
By human intiuition^{2}, these records are clearly similar and they meet the similarity criteria set for the study since $s > S$. However, let's compare records that are not similar by human intuition:
FIRST | MIDDLE | LAST | ADDRESS | CITY | STATE | ZIP | SSN | |
---|---|---|---|---|---|---|---|---|
1 | HARVEY | B | MILK | 1930 RECRUIT YOU AVE | SF | CA | 18896.0 | 314159265.0 |
4 | RICK | P | SANCHEZ | EARTH C-137 | TACOMA | WA | 11893.0 | 929601596.0 |
Rerunning the analysis described above for rec_ind=[1,4]
yields no changes to the weights since no fields in the records are missing. The aggregate similarity score comes out to $s = 24.42 < S$ and therefore these dissimilar records are correctly identified as dissimilar (true negative).
It is worth noting that, for any sufficiently large and diverse dataset, false possitives and false negatives will occur. Establishing the optimal weights and threshold for the study is a matter of art and, at this time, no analytical solution to this optimization problem has been found.
We will pretend, for simplicity sake, that the first record in the comparison was a group centroid and that it was in the Groups dataframe the whole time as the only existing group. The groups dataframe is created the the first record is added as a group centroid.
grp_col = [ 'Group Number', 'Record Number', 'Centroid Flag' ]
Groups = pd.DataFrame( data = { grp_col[0] : 0, grp_col[1] : rec_ind[0], grp_col[2] : True }, index = {0} )
If a record is sufficiently similar to a group centroid then it is appended to that group.
if s >= S:
grp_no = Groups.loc[ Groups[ Groups.columns[1] ] == rec_ind[0] ][ Groups.columns[0] ].values[0]
rec_no = rec_ind[1]
Record = pd.DataFrame(data = { grp_col[0] : grp_no, grp_col[1] : rec_no, grp_col[2] : False }, index = {0})
Groups = Groups.append( Record, ignore_index = True )
Otherwise, it becomes the sole member and acting centroid of its own group with a new group number.
else:
grp_no = Groups[ Groups.columns[0] ].values.max() + 1
rec_no = rec_ind[1]
Record = pd.DataFrame(data = { grp_col[0] : grp_no, grp_col[1] : rec_no, grp_col[2] : True }, index = {0})
Groups = Groups.append( Record, ignore_index = True )
For rec_ind=[1,2]
, our sufficiently similar records, we are left with the single group:
Group Number | Record Number | Centroid Flag | |
---|---|---|---|
0 | 0 | 1 | True |
1 | 0 | 2 | False |
And for rec_ind=[1,4]
, our dissimilar records, we are left with the two groups:
Group Number | Record Number | Centroid Flag | |
---|---|---|---|
0 | 0 | 1 | True |
1 | 1 | 4 | True |
Within the K-means algorithm, the center of a cluster adjusts when new records are assigned. Similarly, we may create individual groups due to uniqueness or incompleteness and then find a more accurate representative for a given identity. Consequently, we must periodically reevaluate which records are centroids.
In a previous section, I describe two measurements for determining the centroid of a group. If the number of records in a group is 2 or fewer then the centroid must be the most complete record: the record with the fewest missing values.^{3} If a group has three or more records then we can take the inner similarity between such records and determine which record is most similar to the records within its set. This way we know that we have selected the best representative of the group.
With the proof of process completed, functions were developed and the full algorithm can now be run against the mock dataset. The study is instantiated and global weights and the threshold are set.
Omega = [ 0.18, 0.15, 0.20, 0.05, 0.05, 0.10, 0.02, 0.25 ]
S = 75
verb = False
Master = Data.copy()
Deck = Data.copy()
grp_col = [ 'Group Number', 'Record Number', 'Centroid Flag' ]
Groups = pd.DataFrame( columns = grp_col )
card = iterator( Deck, True, False )
Groups.loc[0] = [ 0, card.name, 1 ]
Ergo, we have a single group consisting of the first record, which is assigned as the centroid by default.
Group Number | Record Number | Centroid Flag | |
---|---|---|---|
0 | 0 | 17 | 1 |
We iterate through the deck of unassigned records and grade single record (card) against the centroid of each existing group. If the record is sufficiently similar to the centroid of the group then the record becomes a member of said group. Otherwise, it becomes the acting centroid of its own group. After each iteration, the centroids for each group are reassessed according to the criteria described previously.^{4}
number_of_elements = len(Deck)
for c in range( 0, len(Deck) ):
card = iterator( Deck, True, False )
scores = score_against_groups( Master, card, verb )
max_score_ind = np.argmax( scores )
if scores[ max_score_ind ] >= S:
Groups = append_to_groups( Groups, max_score_ind, card.name, False, verb )
else:
Groups = append_to_groups( Groups, np.max( Groups[ Groups.columns[0] ].values )+1, card.name, True, verb )
centroid_assessment( Master, verb )
The final groups dataframe is presented below.
Group Number | Record Number | Centroid Flag | |
---|---|---|---|
0 | 0 | 17 | 0 |
1 | 0 | 16 | 1 |
2 | 1 | 15 | 0 |
3 | 1 | 14 | 1 |
4 | 1 | 13 | 0 |
5 | 2 | 12 | 1 |
6 | 3 | 11 | 1 |
7 | 4 | 10 | 1 |
8 | 5 | 9 | 1 |
9 | 6 | 8 | 1 |
10 | 7 | 7 | 1 |
11 | 8 | 6 | 1 |
12 | 9 | 5 | 1 |
13 | 10 | 4 | 1 |
14 | 11 | 3 | 0 |
15 | 11 | 2 | 1 |
16 | 12 | 1 | 1 |
17 | 12 | 0 | 0 |
For ease of reference, we can append the group number column to the original dataset.
GROUP NO | FIRST | MIDDLE | LAST | ADDRESS | CITY | STATE | ZIP | SSN | |
---|---|---|---|---|---|---|---|---|---|
0 | 12 | HARVEY | B | MILK | 1930 RECRUIT YOU AVE | SAN FRANCISCO | CA | 18896.0 | 314159265.0 |
1 | 12 | HARVEY | B | MILK | 1930 RECRUIT YOU AVE | SF | CA | 18896.0 | 314159265.0 |
2 | 11 | H | B | MILK | 1930 RECRUIT YOU AVE | SF | CA | NaN | 314159265.0 |
3 | 11 | H | NaN | NaN | 1930 RECRUIT U | NaN | NaN | 18896.0 | NaN |
4 | 10 | RICK | P | SANCHEZ | EARTH C-137 | TACOMA | WA | 11893.0 | 929601596.0 |
5 | 9 | PYOTR | I | TCHAIKOVSKY | 1840 N ST PETER | VOTKINSK | GA | 11893.0 | 155633999.0 |
6 | 8 | RUPERT | L | DONLEY | 518 EVEREST BLVD | UDEMY | FL | 12358.0 | 123789235.0 |
7 | 7 | ELEANOR | P | TWINE | 7878 SEVEN EIGHT | SMALL TOWN | LA | 26916.0 | 148978105.0 |
8 | 6 | MICHELL | M | MENENDEZ | 58 CLOUDLESS SKY WAY | NaN | NaN | 32156.0 | 203254687.0 |
9 | 5 | CHARLES | F | KANE | 794 BROKEN HOME AVE | ROSEBUD | PA | 56465.0 | 489324789.0 |
10 | 4 | HOLDEN | K | CAULFIELD | 79 IRONICAL DRIVE | INNOCENCE | PA | 45687.0 | 102354548.0 |
11 | 3 | ATTICUS | J | FINCH | 1930 JUSTICE AVENUE | MAYCOMB | AL | 36460.0 | 121611201.0 |
12 | 2 | KIT | H | SHARBER | PARADISE PLACE | ALMA | OR | 10002.0 | 485943315.0 |
13 | 1 | PAUL | MUAD'DIB | ATREIDES | 314 PALACIAL ST | CALADAN | CA | 15587.0 | 111555888.0 |
14 | 1 | PAUL | MAWDIB | ATRAYDEZ | PALACIAL ST | CALADAN | CA | 15587.0 | 111555888.0 |
15 | 1 | PAUL | MAWDIB | ATRAYDEZ | 314 PALACIAL STREET | CALIDAN | CA | 15587.0 | 111555888.0 |
16 | 0 | MICHEAL | J | BURRY | 20400 STEVENS CREEK BLVD | CUPERTINO | CA | 95014.0 | 158971324.0 |
17 | 0 | MIKE | NaN | BURRY | NaN | CUPERTINO | CA | NaN | 158971324.0 |
Pretty great, honestly. Wizard, one could say, but perhaps one shouldn't until Vader brings it back.^{5} One interesting phenomena can be seen in the Harvey Milk entries (records 0-3). The group is split by the algorithm into two distinct groups which would otherwise be combined by human intuition.^{6} This phenomena is bound to occur for larger datasets but methods may be developed to integrate factions. For instance, after centroid-assessment, centroids could be compared against each other and integrated according to the theshold.
Warning, some readers may be aleric to excessive use of headers starting with the letter 'p'. Caution and PPE are strongly advised. Remember your lock-out tag-out procedures. A stop work can be called at any time by any person. ↩
It can neither be shown nor will it be shown. Just deal with it. ↩
Additionally, we could find which record has the longest strings, and is therefore the most detailed but this approach has its drawbacks. ↩
This approach is highly inefficient since the centroid assessment only needs to occur for the groups that have grown under iteration. As the number of groups increase, so too does the computational load of reassessing every group at the end of the iteration. ↩
One is never too old for Robot Chicken Star Wars. Fight me. ↩
In the words of writer and LGBTQ activist Alexander Leon "Queer people don't grow up as ourselves, we grow up playing a version of ourselves that sacrifices authenticity to minimise humiliation & prejudice." If you don't think I set this up, for the Harvey Milk records to be split into two clearly identical but cohesive internal factions as a metaphor for the queer experience, then you clearly don't know me well enough. Someone takes her polisci minor a little too seriously. Then again, I'm still a footnote, but one endures. ↩