Tidy Data Defined

Data assessment involves examining:

  • data Quality and
  • data Tidiness.

Quality

Quality issues pertain to the content of data. Low quality data is also known as dirty data. There are four dimensions of quality data:

  • Completeness
    • Do we have all of the records that we should?
    • Do we have missing records or not?
    • Are there specific rows, columns, or cells missing?
  • Validity
    • Invalid data doesn’t conform to a pre-defined schema. A schema is a defined set of rules for data.
    • These rules can be real-world constraints and table-specific constraints:
      • Example real-world constraint: People cannot be -60 inches tall.
      • Example table-specific contraint: An observation does not have a unique key, though it is required in the table. For example, a single social security number has multiple names associated to it.
  • Accuracy
    • Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
      • Example: a patient’s weight that is 5 lbs too heavy because the scale was faulty.
  • Consistency
    • Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing.
      • Example: Gender is indicated as both M and Male in the same table.
    • Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

Tidiness

Tidiness issues pertain to the structure of data. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Relational theory defines “tidy data” in more precise terms as First Normal Form data. First Normal Form data forms a relation in the technical sense.