Getting Clean Data: Raw data vs. Tidy data

Definition of data:

    “Data are values of qualitative or quantitative variables, belonging to a set of items.”

    The raw data are the original source of data. They’re often very hard to use for data analysis, because they’re complicated or they’re complicated or they’re hard to parse, or they’re very hard to analyze. Data analysis actually includes the processing or the cleaning of the data. In fact, a huge component of a data scientist’s job is performing those sorts of processing operations. A critical component is that all steps should be recorded. Pre-processing often ends up being the most important component of the data analysis in terms of effect on the downstream data. If you’re going to be a data scientist who’s careful about understanding what’s really happening in the entire data processing pipeline.

    Raw data

  • The original source of the data
  • Often hard to use for data analyses
  • Data analysis includes processing
  • Raw data may only need to be processed once

    Processed data

  • Data that is ready for analysis
  • Processing can include merging, subsetting, transforming, etc.
  • There may be standards for processing
  • All steps should be recorded

The four things you should have:

  1. The raw data
  2. A tidy data set
  3. A code book describing each variable and its values in the tidy data set
  4. An explicit and exact recipe you used to go from 1 -> 2,3

You know the raw data is in the right format if you:

  1. Ran no software on the data
  2. Did not manipulate any of the numbers in the data
  3. You did not remove any data from the data set
  4. You did not summarize the data in any way

Final form of tidy data:

  1. Each variable you measure should be in one column
  2. Each different observation of that variable should be in a different row
  3. There should be one table for each “kind” of variable
  4. If you have multiple tables, they should include a column in the table that allows them to be linked
  5. Include a row at the top of each file with variable names
  6. Make variable names human readable
  7. In general data should be saved in one file per table

The Code Book:

  1. Information about the variables (including units) in the data set not contained in the tidy data
  2. Information about the summary choices you made
  3. Information about the experimental study design you used
  4. Common format: Word/text file
  5. “Study design” section: a thorough description of how you collected the data
  6. “Code book: section: describes each variable and its units

The Instruction List:

  1. Ideally a computer script (R or Python or …)
  2. The input for the script is the raw data
  3. The output is the processed, tidy data
  4. There are no parameters to the script

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s