What Does Dirty Data Even Mean? Does It Need Washing?

by Jeffrey Brian Thompson

When we talk about data, you might hear terms like "clean" and "dirty." But no, we're not suggesting your data has been playing in the mud—though our accompanying image might have you think otherwise!

Clean Data Explained

First up, let's define "clean data." This is data that's neatly organized and ready for analysis. It means:

  • No duplicate records, so each row is unique.
  • Headers are clearly formatted and consistent.
  • Data fields (columns) stick to one data type, like text or numbers.
  • Categories get their own fields, avoiding mix-ups.
  • Dates are consolidated into a single column where possible.
  • Rows are as complete as possible, minimising blank spaces.

So, What's Dirty Data?

Dirty data is essentially the opposite of clean data. It's the dataset that hasn't been tidied up yet. This might include duplicate records, jumbled headers, inconsistent data types, and more. It's like having a file on your computer that's somehow gotten covered in mud—metaphorically speaking.

Cleaning Up: What Are Your Options?

Cleaning dirty data means transforming it into clean data. Tools like Tableau Prep are excellent for this task. They let you:

  • Rename headers to something more understandable.
  • Split columns for better categorisation.
  • Filter out unnecessary information.
  • Correct spelling errors.
  • Create new or remove irrelevant data fields.

With software like Tableau Prep, you can take even the muddiest dataset and clean it up until it's sparkling, ready for analysis.

Conclusion

While dirty data doesn't need soap and water, it does require a bit of TLC with data cleaning tools. By understanding what makes data clean or dirty, you can ensure your datasets are in top shape for any analysis or insights you're planning to derive. So next time you hear "dirty data," just remember: it's nothing a good data cleaning tool can't fix!