Many organisations have data quality problems. Some organisations have data quality problems so serious it stops them from using key systems in a meaningful way. When that happens, sometimes the consultants get called. Having been the consultant on a few of these projects, I can say that it tends to be ungrateful work. Effectively, you are trying to fix a serious problem in a running production system that inevitably no one feels accountable for and everyone agrees should never have happened.
So why does it happen, then? Usually, it boils down to the fact that little attention was given to data quality in the first place. Everybody will sit around the table and agree that something should be done to ensure that accurate data is put into the system and at the end of the day that will usually boil down to making a few fields mandatory, putting in a couple of validation rules, and giving the admins a merge tool to sort out the duplicates, when they get around to it.
As with so many ailments, the key to dealing with data quality issues lies in prevention not treatment after the fact. That means treating data quality as first and foremost a design issue. The most important way to prevent data quality problems down the line is to create a user experience that not only makes it hard for users to enter poor quality data, but makes it easy for them to enter exactly what you need them to without overwhelming them with so much to do that they start to find creative ways of working around the system. You get good data by making the user experience conducive to getting good data.
This is easier said than done, but from experience here are a few good places to start:
- Have the user follow a flow that minimizes errors. That means for instance having users search for information (say an existing person record in your database) up front to reduce duplicates or asking a few key questions up front to minimize the number of data points you need to collect on the following forms to avoid users skipping important fields because of cognitive overload.
- Only collect what you need right now. One key reason you get bad data is that users get tired of having to fill out large forms and then start finding ways of gaming the system to avoid boring, repetitive work. Therefore limiting the amount of data you collect to only a few points that are meaningful in the current system context will greatly enhance data quality.
- Limit options. Don’t give more choice than you need to. Limit the number of options in controlled lists. Prefer controlled lists or checkboxes to open ended answers when you can. Do enforce strong validation rules on text fields, where it’s possible.
- Pre-populate meaningful defaults. Don’t just go with the first item in the list. Have a guess at the users country based on their IP. If you’ve got their address for heavens sake don’t make them type it again. If they selected a given option last time, make that the default option this time.
- Automate data gathering. Even better if you can infer the answers to questions from other information you have available in the users context or previous steps of the user journey, don’t even ask the user, just generate the data point automatically. Remember, the guess your algorithm makes needs to be better than the data quality you get from manual entry, but no better than that.
- Use good form design principles. Once you have gathered as much data as you can through simpler means, make sure to follow good design principles for the inevitable form. Don’t overwhelm the user. Keep similar things close together. Use established conventions for form controls and the choices you provide. Use terminology familiar to the user.
- Avoid loopholes. This includes things like ad-hoc flows for marketing campaigns, special tools for the admins with looser validation requirements, and 3rd party integrations that don’t enforce the validation rules. While they can be unavoidable, each time you create a loophole, you’re compounding the design problem that is already there.