I’ve built a beautiful insights platform, but users are now telling me our data is rubbish.
Sadly, this is an all too familiar scenario. Organisations wisely invest to ensure their BI and Data Science projects are properly scoped, developed and deployed, but often never really test the quality of their data until it’s in the hands of their consumers. Consequently, the first day they cut to live, business users start to moan:
- There are duplicate records in my reports, so none of my KPIs can be trusted
- When I report on a person, there’s information missing that I know we hold (e.g. missing transactions)
- There are annoying data quality issues that make reports inaccurate and the filters hard to use
- The conclusions of my AI models just aren’t reliable – they’re no better than my gut instincts
Left unaddressed, you’ll notice business consumers conclude “I don’t trust it, I’ll get my analysts to keep building my reports in Excel”.
1: So what’s causing this issue?
Well firstly, the issues typically causing the symptoms above are not ones that traditional Data Warehousing and BI tools were designed to identify or fix, hence often they’re not picked up on when scoping data insights projects. Secondly, they can absolutely be fixed and your project rescued – don’t panic, read on!
The issues can occur anywhere business processes allow, typically where data entry validation rules in source systems are weak. However, 9 times out of 10, I’ve found the worst culprit for duplicate records is the CRM system (remember, that thing that was supposed to deliver a single customer view!).
Records decay at a rate of 12% per annum, as people change companies, roles, their name or address. These changes aren’t easily tracked and often a new record is created. Customer service agents, or field operatives under time pressure, struggle to find existing contacts and create duplicates too.
Fragmentation is where the information on a customer/citizen is recorded in more than one system, but can’t be easily joined up. Typically this occurs when multiple departments have separate engagements with the same individual but don’t often communicate. Each department creates and maintains their own record.
Councils provide a good example of this as each department operates largely independently and maintains its own dataset (Revs & Bens, Housing, Parking, Electoral Services, Planning…). Joining these up to produce a Single View of Citizen’s Debts or a Household View for Social Services is not straightforward.
The potential list of data quality issues is substantial, but there are a few that frequently surface:
- Test data and erroneous records
- Invalid entries (e.g. tel: 010101)
- Placeholder values (e.g. 1/1/1900)
- Conformity to standards (e.g. addresses aren’t formatted correctly, inconsistent gender code sets).
2: Why didn’t my Data Lake fix this?
The simplest answer, these records are generally “inconsistent” and the technologies and techniques deployed to deliver data insights don’t handle inconsistencies well.
Let’s expand upon our council scenario and assume:
- Revs & Bens have a record for “Tom Hughes” who pays council tax
- Parking have a record for “T Hughes” who has a permit
- Electoral Services have a record for “Thomas Hughes”
- Customer Services have 3 records for “Thom H”, “Thomas Hughes”, “Tohmas H” where I’ve phoned up
You’ll notice three types of inconsistency in the above records:
- Synonyms: Tom and Thomas
- Phonetics: Tom & Thom
- Displaced values: Tohmas
Traditional Data Integration approaches match records “deterministically”. They either identify attributes like the first name, last name and email address as exactly the same, or they use an ID or key that’s common across the records to join them.
In the council’s case, the records are inconsistent so don’t match exactly and there’s no identifier as each department has created its own record. They’re faced with three choices:
- Bring all the records together and fill reports with duplicates
- Cherry pick on set of citizen records and ignore the rest
- Find a method to fix the problem (see 4: How can I fix this issue!)
3: When will this problem become critical?
Incomplete records and inaccurate details will always undermine decision making. The level of duplication, missing information and quality issues are factors that we can analyse and measure to crudely gauge the impact.
Equally, we can consider analytical subject areas where incomplete and inaccurate information is keenly felt and ensure we test our data during project scoping. Below are a few from our council:
- Debt Management: Single view of citizen debt and propensity to pay models
- Social Services: Identifying vulnerable individuals, ensuring safeguarding and on-time interventions
- Fraud & Error: Correctly identifying issues entitlements (e.g. Single Person Discount)
- Customer Services: Both for customer response and also service redesign & targeting
- GDPR: Consents, DSARs
- Multi-Agency Collaboration: One shared record and longitudinal insights for collaborative service delivery
4: How can I solve the issue?
Simpson Associates work with our customers to identify when these challenges could undermine their data insights when undertaking project scoping. Where required we recommend and deploy Data Matching & Quality tools, which address these issues through the following approach:
- Analyse: data sources to identify matching issues and assess the impact
- Address: Find & fix data quality problems
- Match: We use a probabilistic matching engine, designed to match inconsistent records. This compares each record to every other, at speed, using sophisticated techniques (including Synonym, Phonetic, Displaced Value matching) and determines how likely it is that two records are the same person.
- Master: Intelligently select the most up-to-date and accurate details from all the records to create a trusted “Golden Record”
- Enrich: records using third party sources where we’ve gaps in our records
- Merge: records and bring together the transactions to deliver a complete view of a citizen’s engagements
Taking this approach and supported by best-fit technology, the information in our reports and AI models will be complete and accurate. We can drive decision making with confidence.
If you feel duplicates, fragmentation or quality issues are undermining your data insights, you might be interested in our Data Workshop.
Tom Hughes, Business Development Manager, Simpson Associates
Back to blog