# Tidy Data Defined

Data assessment involves examining:

• data Quality and
• data Tidiness.

## Quality

Quality issues pertain to the content of data. Low quality data is also known as dirty data. There are four dimensions of quality data:

• Completeness
• Do we have all of the records that we should?
• Do we have missing records or not?
• Are there specific rows, columns, or cells missing?
• Validity
• Invalid data doesn’t conform to a pre-defined schema. A schema is a defined set of rules for data.
• These rules can be real-world constraints and table-specific constraints:
• Example real-world constraint: People cannot be -60 inches tall.
• Example table-specific contraint: An observation does not have a unique key, though it is required in the table. For example, a single social security number has multiple names associated to it.
• Accuracy
• Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
• Example: a patient’s weight that is 5 lbs too heavy because the scale was faulty.
• Consistency
• Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing.
• Example: Gender is indicated as both M and Male in the same table.
• Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

## Tidiness

Tidiness issues pertain to the structure of data. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:

• Each variable forms a column.
• Each observation forms a row.
• Each type of observational unit forms a table.

Relational theory defines “tidy data” in more precise terms as First Normal Form data. First Normal Form data forms a relation in the technical sense.