Data Quality & Tidiness

Data assessment involves examining data quality and tidiness. The following definitions for these terms are taken from Udacity coursework:

Quality issues pertain to the content of data. Low quality data is also known as dirty data. There are four dimensions of quality data:

  • Completeness: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
  • Validity: we have the records, but they’re not valid, i.e., they don’t conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
  • Accuracy: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient’s weight that is 5 lbs too heavy because the scale was faulty.
  • Consistency: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

Tidiness issues pertain to the structure of data. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Content for this article is taken from information I learned while pursuing the Udacity Data Analyst Nanodegree. Udacity is an online tech education service that offers both free and paid, tech-focused training. Learn more.