Tidy Data Defined
Data assessment involves examining:
- data Quality and
- data Tidiness.
Quality
Quality issues pertain to the content of data. Low quality data is also known as dirty data. There are four dimensions of quality data:
- Completeness
- Do we have all of the records that we should?
- Do we have missing records or not?
- Are there specific rows, columns, or cells missing?
- Validity
- Invalid data doesn’t conform to a pre-defined schema. A schema is a defined set of rules for data.
- These rules can be real-world constraints and table-specific constraints:
- Example real-world constraint: People cannot be -60 inches tall.
- Example table-specific contraint: An observation does not have a unique key, though it is required in the table. For example, a single social security number has multiple names associated to it.
- Accuracy
- Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
- Example: a patient’s weight that is 5 lbs too heavy because the scale was faulty.
- Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
- Consistency
- Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing.
- Example:
Gender
is indicated as bothM
andMale
in the same table.
- Example:
- Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.
- Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing.
Tidiness
Tidiness issues pertain to the structure of data. These structural problems generally prevent easy analysis. Untidy data is also known as messy data. The requirements for tidy data are:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Relational theory defines “tidy data” in more precise terms as First Normal Form data. First Normal Form data forms a relation in the technical sense.
Content for these notes was taken from information I learned while pursuing the Udacity Data Analyst Nanodegree. Udacity is an online tech education service that offers tech-focused training. Learn more.