Data scientists have to handle data from many different sources. Many of these sources however aren’t very useful until they can be combined together (list of potential customers, volunteer signups at a campaign rally, public records, etc). When an organization is trying to understand people, it needs all the information it has about individuals linked together in a single record.
Often, we have one canonical data source – a voter file, or customer base – and need to match new data. We have to do be able to do this very quickly for two different use cases. First, for web apps that may need to look up a users’ friends in real-time, and second for matching massive new datasets to our existing ones. We also need this capacity to be scalable – we’re not running millions of records through 24/7, but when we are we want it done quickly.
This is a classic problem of record linkage. Besides the problem of how to decide if a pair of records are identical, the main challenge is that it is prohibitive to exhaustively compare all pairs. In order to find links in a reasonable amount of time, we need to quickly limit our search space. The most effective way to do this is by blocking: grouping records together based on a rough summary (a token). There are a number of sophisticated ways to generate these tokens. This IEEE journal article by Prof Peter Christen provides a useful survey of some of the newer methods. Clearly, there is a tradeoff between constraining the number of records returned (Pair Quality), and ensuring that the best-matching record is actually returned (Pair Completeness). Notably, he finds that traditional blocking performs well for person-based data. While some methods have slightly higher completeness, they suffer greatly in Pair Quality.
We chose Traditional Blocking for its balance of performance and simplicity. In order to improve Pair Completeness, we create entries in multiple blocks for each record, based on different combinations of fields. Some examples:
- First Name, Last Name
- First Name, Last Name, Geocode
- Last Name, Address
- Phone Number
If too many records share a token, we don’t bother storing it. If we don’t have more specific information when we’re trying to match a record, we won’t be able to distinguish between them anyway. On the other hand, for the tokens which are unique, we can find that record even if other information has changed…for example, a person with a relatively unique name regardless of where they have moved in the country.
This blocking narrows our search from the entire US population down to, on average, a couple hundred records. We then use a custom algorithm employing Levenshtein Distance and field-specific probabilities, derived from the population, to select the best possible record or determine that there is no match.
While we use a lot of Ruby at Civis, it was clear that we would need something faster for this application. Based on its reputation as a fast, programmer-friendly language, we chose Go. Golang is built for speed, and since we started with go1.1 it’s only getting faster.
While small databases may be stored in-memory, larger record sets require a database. We selected AWS’s Dynamo key-value store. It allows us to store >800M records and retrieve any individual one in <3.5ms. The pricing is primarily based on maximum available throughput, which can be adjusted several times per day. While there is no official AWS Go library, there is community has created the goamz library has done a good job of adding support. While it’s great to have community-based libraries, we’ve found it important to follow the Go FAQ advice of forking the ones we use. Without a standard library management system (like Ruby’s Bundler), it can otherwise be a bit too easy for external library changes to break your code.
Finally, for running the application, we use AWS’s Elastic MapReduce (EMR), Amazon’s on-demand implementation of Hadoop, the open-source platform for Map-Reduce tasks. We use the EMR infrastructure in two parts of our system. First, we use it to transform large record sets into key-value stores. Each record is mapped to several tokens, and then each token is reduced to a list of records. Each of these steps is a Go program. Second, we use it as a simple (if brutish) way to parallelize our bulk match jobs. In the Map/Reduce framework, we effectively just have a Map step (each record is mapped to its best match) with no reduce step. When our clients create a matching job with millions of rows of data, we create temporary clusters to run against the Dynamo stores, and only pay for them until the job is complete.
A few billions rows later, our infrastructure has proven solid. We’re continually working on improvements, such as eliminate unnecessary tokens, improving the pairwise scoring of records, incremental dynamo updates, and more efficient storage (ie, go’s blob encoding instead of JSON)