The topic of GitHub Copilot and OpenAI Codex have been discussed in detail — the legal and ethical issues, as well as security risks. As Civis weighs the technical and…
Supporter data is the engine driving nonprofit fundraising efforts, empowering organizations to identify key donor groups and formulate messages that resonate with their intended audiences.
But this supporter data can be incomplete, out of date, and siloed. The typical Civis Platform client works with about eight different databases or applications that use person-level data — an array that encompasses voter files, CRMs, donor management systems, digital marketing platforms, and social media, and often spans tens of millions of Americans. These different systems rarely communicate with each other, and frequently present integration headaches.
Unifying information on individual supporters poses myriad challenges as well. For example, if a supporter’s political affiliation is stored in a voter file and their donation email address is tucked away in a CRM, the organization would need some sort of linking customer ID to connect these pieces of data to tailor email communication by political affiliation. But organizations rarely if ever have a common universal identifier, meaning data analysts must spend hundreds of hours each year manually unifying their data or maintaining complex technical joining pipelines — with varying degrees of success.
Civis Analytics built our Identity Resolution (IDR) system to solve this problem. Discover how IDR provides your data teams a stable, universal identifier for each person across your systems — no matter how many duplicates or data sources your organization may have.
Civis Identity Resolution is a cloud-based application that links together person-level data from multiple sources to create a unique, stable identifier for each individual. Using a combination of advanced machine learning algorithms and customizable logic, Civis IDR gives teams across the organization a master key to understanding the people in their universe, enabling non-technical users to better recognize, segment, and reach the individuals who matter most.
At its core, Civis IDR takes one or more data sources about people (members, customers, donors, etc.) as input and locates matches among the person records they contain. It uses these matches to search within and across sources and to identify duplicates — sets of records that refer to the same individual, and should be linked together.
Civis IDR addresses common challenges including:
From a high-level perspective, Civis IDR encompasses five steps:
The IDR process begins with entering your person-level data into Civis Platform, our flexible, scalable data management solution that makes it easy for nonprofits, advocacy groups, and other mission-driven organizations to import, transform, analyze, and report on their data.
Using Civis Platform’s Imports and Data Enhancements functions, IDR can ingest data from virtually any source, including many popular software solutions like CRMs, CSVs and flat files, and other databases, like Amazon Redshift. Platform additionally provides Data Enhancements to standardize person datasets, including jobs for standardizing addresses (CASS) and updating potentially outdated addresses (NCOA) as well as a custom Person Data Standardization job for detecting and correcting issues with the formatting of phone numbers, email addresses, etc.
Users can easily load data into the IDR app and designate which fields should be used in resolving identities, like names, emails, phone numbers, and other common pieces of personally identifiable information. Based on what they know about their data, users may also configure advanced options for how IDR resolves identities and creates master IDs across sources, including custom overrides to the machine learning-based algorithms. Users can additionally specify “confidence thresholds” — i.e., how certain the algorithm must be when it designates two records as a match (i.e., that they pertain to the same individual).
To find groups of similar person records among potentially millions of input records, Civis IDR next identifies pairs of similar person records. Because the number of pairs can be extremely large (e.g., a trillion pairs for a million input records), making it prohibitively costly to perform detailed comparisons of each possible pair of records, we find candidate match pairs by looking for shared coarse-grained features — the same name and ZIP code, for example, or the same name and birthday.
Once we have a set of candidate match pairs, Civis IDR uses a statistical model employing a more detailed and computationally intensive set of features to generate a match score, ranging from 0 to 1, as an estimate of the probability that the two records match. This feature set includes the frequencies of names and various population statistics, so that matching on a rare name in a sparsely populated location results in a higher score than matching on a common name in a crowded metropolitan area. The feature set also takes into account factors like treating a mismatch of a piece of information (a phone number, for instance) differently from cases where one of the records is missing that piece of information.
Identifying pairwise matches between sources and within each source determines the most likely candidate matches for a record, if any exist. However, this list of pairs of matching records does not by itself provide a grouping of all the records corresponding to an individual, requiring that we next convert the list of pairwise matches into a graph representation. Each vertex in the graph represents a record, and each edge in the graph is a match, with the edges being weighted according to the pairwise match scores. We then use graph clustering algorithms to find clusters of records, where each cluster corresponds to a distinct individual.
From one or more sources of person data, Civis IDR produces two outputs:
After configuring their sources and advanced options, users can run their IDR pipeline to produce a Cluster Table, which can act as a crosswalk between all sources. For each record in the input data, the Cluster Table provides a unique “resolved ID” identifier corresponding to an individual: for example, if there are 10 source records that correspond to the same individual, this output table will contain 10 records. Users can run the pipeline once, or add it to other workflows on a schedule.
Once we have identified which input records correspond to distinct individuals, we can create the Golden Table, which provides one record per distinct individual found by the IDR system. Each record contains the “best” information about that individual (name, street address, email address) based on the source records that IDR linked together. Users may configure how this information is selected (e.g., indicating whether one of their data sources should be given preference over others when selecting email addresses).
In addition to the Civis IDR app’s core clustering capabilities, users can leverage a suite of features designed to make the process of identity management easier and more efficient, including:
In some cases, it can be very clear whether or not two records are a match — for example, two records with the exact same values for full name, email address, phone number, and other fields almost certainly represent the same person. But other times, matches can be more difficult to evaluate. Consider the following pair of records:
Determining whether these records correspond to the same individual is dependent on information like the frequency of the first and last names, the population of the ZIP code, and the likelihood of differences within a field (like email address) — information humans don’t have. Automated systems, on the other hand, can take frequency statistics and other data into account to accurately estimate whether these records represent the same individual.
While many customers ask about match rates — the fraction of their input records that will be matched — a higher match rate is not necessarily better. Match rates can increase by including additional pairs of records that are unlikely to represent the same individuals, like records that just match on first and last name.
Another key factor: The rate and cost of false positives. In the context of matching for IDR, a false positive is a pair of records that are marked as a match but actually represent different people. While a high match rate and low false positive rate are best, a false positive rate is difficult to determine because it requires distinguishing false positives from true positives (correct matches). In lieu of a precise computation of the false positive rate, it can be helpful to estimate the false positive rate by examining the personal identifiable information (PII) for a sample of several dozen or more matches to check that most of the matches look reasonable — i.e., that records with wildly different PII values aren’t frequently matched together.
It’s also important to consider the cost of false positives (matching records incorrectly) versus false negatives (missing a good match). Depending on the downstream business use case, false positives may be more or less expensive than false negatives: for example, if matching two people incorrectly results in annoying a customer with irrelevant promotional emails, then false positives might prove very costly, meaning an organization may wish to establish a higher threshold for what constitutes a match. But if it’s relatively harmless to send a few extra emails based on bad matches, the organization may wish to implement a lower threshold to reduce false negatives and hike the match rate.
That’s why Civis exposes a threshold parameter to control how strictly the IDR system should group different records together under the same resolved ID. We also provide functionality for running experiments with different thresholds and for examining sample outputs, so that users can establish their own threshold based on their tolerance for false positives vs. false negatives.
What Separates Civis IDR
We’ve refined our approach to identity resolution over multiple years of development and use, and by listening to feedback and requests from dozens of clients across multiple industries. The tentpoles of our philosophy are:
Accuracy and Quality. We continually release improvements to our IDR process, shaped by our experiences with customers. As we identify corner cases, customer best practices, and new use cases, we incorporate them into our algorithm for everyone to adopt.
Speed and Scalability. Civis IDR uses state-of-the-art distributed computing tools. It can process tens of millions of records in a few hours and can be scaled to hundreds of millions of records as needed. We engineered our product to scale with your needs and business processes.
Usefulness. A product is only as good as how seamlessly it fits into your existing processes. We strive to meet customers where they are, whether by adding additional API integrations for importing data, understanding how you’d like to export results, or simply helping make your job easier. We also assist with IDR setup and solve related problems; in addition, members of our ADS team can analyze data quality and suggest best practices to improve usefulness.
Trust and Transparency. We are proud of our work, and we want you to understand it. In turn, we want to hear from you. Do you have an idea we should incorporate? Let us know, and we’ll work on it. Do you have a question? Please reach out to us.
Civis IDR enables organizations to save time and money on tedious manual deduplication pipelines, increase conversion rates, and win new customers. One national nonprofit used its master resolved IDs to save hundreds of hours on membership reporting to its funders: by using data science to quickly identify 6.1 million duplicative members within and across systems, the organization was able to continuously and accurately measure and report membership growth for key stakeholders.