GETTING CLEAN DATA: Reading local flat files

Reading  local CSV files

  if (!file.exists(“data”)){



    fileUrl <- “https://web_address&#8221;

    download.file(fileUrl, destfile = “cameras,csv”, method = “curl”)

    dataDownloaded <- data()

So now the data have been downloaded from the website, and actually is sitting on my computer, it’s local data to my computer.

The most common way that they’re loaded is with the read.table function:

Loading flat files – read.table():

  • The main function for reading data into R
  • Flexible and robust but requires more parameter
  • Reads the data into RAM – big data can cause problems
  • Important parameter: file, header, sep, row.names, nrows
  • Related: read.csv(), read.csv2()


cameraData <- read.table(“./data/cameras.csv”, sep = “,”, header = TRUE)

some important parameters:

  • quote: tell R whether there are any quoted values, quote = “” means no quotes
  • na.strings: set the character that represents a missing value
  • nrows: how many rows to read of the file
  • skip: number of lines to skip before starting to read

Reading Excel files

Download the excel file to load:


    fileUrl <- “https://web_address&#8221;

    download.file(fileUrl, destfile = “./data/cameras.xlsx”, method = “curl”)

    dateDownloaded <- data()

The R library that is useful for this is the xlsx package.


    cameraData <- read.xlsx(“./data/cameras.xlsx”, sheetIndex = 1, header = TRUE)

You can read specific rows and specific columns.

colIndex <- 2:3

    rowIndex <- 1:4

    cameraDataSubset <- read.xlsx(“./data/cameras.xlsx”, sheetIndex = 1, colIndex = colIndex, rowIndex = rowIndex)


  • The write.xlsx function will write out an Excel file with similar arguments
  • read.xlsx2  is much faster than read.xlsx but for reading subsets of rows may be slightly unstable
  • The XLConnect package has more options for writing and manipulating Excel files
  • The XLConnect vignette is a good place to start for that package
  • In general it is advised to store your data in either  a database or in comma separated files (.csv) or tab separated files (.tab/.txt) se they are easier to distribute


GETTING CLEAN DATA: Downloading files

Knowing your working directory:

getwd() : gets the working directory, tells you what directory you’re currently in

    setwd(): sets a different working directory that you might want to move to.

Checking for and creating directories:

file.exists(“directoryName”): will check to see if the directory exists

dir.create(“directoryName”): will create a directory if it doesn’t exist

example (checking for a “data” directory and creating it if it doesn’t exist):

if (!file.exists(“data”)) {



Getting data from the internet – download.file():

Downloads a file from the internet

parameters: url: the place that you’re going to be getting data from.

destfile: the destinaiton file where the data is going to go.

method: needs to be specified particularly when dealing with https.

Useful for downloading tab-limited, CSV files, Excel files.

Download a file from the web:

fileUrl <- “https://address&#8221;

download.file(fileUrl, destfile = “./data/cameras.csv”, method = “curl”)



  • If the url starts with http you can use download.file()
  • If the url starts with https on Mac you may need to set method = “curl”
  • If the file is big, this might take a while
  • Be sure to record when you downloaded





Getting Clean Data: Raw data vs. Tidy data

Definition of data:

    “Data are values of qualitative or quantitative variables, belonging to a set of items.”

    The raw data are the original source of data. They’re often very hard to use for data analysis, because they’re complicated or they’re complicated or they’re hard to parse, or they’re very hard to analyze. Data analysis actually includes the processing or the cleaning of the data. In fact, a huge component of a data scientist’s job is performing those sorts of processing operations. A critical component is that all steps should be recorded. Pre-processing often ends up being the most important component of the data analysis in terms of effect on the downstream data. If you’re going to be a data scientist who’s careful about understanding what’s really happening in the entire data processing pipeline.

    Raw data

  • The original source of the data
  • Often hard to use for data analyses
  • Data analysis includes processing
  • Raw data may only need to be processed once

    Processed data

  • Data that is ready for analysis
  • Processing can include merging, subsetting, transforming, etc.
  • There may be standards for processing
  • All steps should be recorded

The four things you should have:

  1. The raw data
  2. A tidy data set
  3. A code book describing each variable and its values in the tidy data set
  4. An explicit and exact recipe you used to go from 1 -> 2,3

You know the raw data is in the right format if you:

  1. Ran no software on the data
  2. Did not manipulate any of the numbers in the data
  3. You did not remove any data from the data set
  4. You did not summarize the data in any way

Final form of tidy data:

  1. Each variable you measure should be in one column
  2. Each different observation of that variable should be in a different row
  3. There should be one table for each “kind” of variable
  4. If you have multiple tables, they should include a column in the table that allows them to be linked
  5. Include a row at the top of each file with variable names
  6. Make variable names human readable
  7. In general data should be saved in one file per table

The Code Book:

  1. Information about the variables (including units) in the data set not contained in the tidy data
  2. Information about the summary choices you made
  3. Information about the experimental study design you used
  4. Common format: Word/text file
  5. “Study design” section: a thorough description of how you collected the data
  6. “Code book: section: describes each variable and its units

The Instruction List:

  1. Ideally a computer script (R or Python or …)
  2. The input for the script is the raw data
  3. The output is the processed, tidy data
  4. There are no parameters to the script

Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!

I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…

Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution…

View original post 2,161 more words

R note: quantiles, averages, standard deviations

    To get a summary from most basic R statistics you may enter

> summary(dataset$variable)

The typical output the summary() function gives include:

Min, 1st Qu, Median, Mean, 3rd Qu. Max.

Minimum, Maximum and Range in R









There are 4 basic quantiles in every data collection.

>quantile(dataset$variable, 1/4)  #Gives the first quantile

>quantile(dataset$variable, 2/4)  Gives the second quantile

>quantile(dataset$variable, 3/4)  Gives the third quantile

>quantile(dataset$variable, 4/4)  Gives the fourth quantile

Mean Absolute Deviation in R


Median Absolute Deviation (MAD) or Absolute Deviation Around the Median is a robust measure of central tendency (the most common measures of central tendency are the arithmetic mean, the median and the mode).

Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions. Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. The interquartile range is also resistant to the influence of outliers, although the mean and median absolute deviation are better in that they can be converted into values that approximate the standard deviation.

Essentially the breakdown point for a parameter (median, mean, variance, etc.) is the proportion or number of arbitrarily small or large extreme values that must be introduced into a sample to cause the estimate to yield an arbitrarily bad result. The median’s breakdown point is .5 or half (the mean’s is 0). This means that the median only becomes “bad” when more than 50% of the observations are infinite.


set <- c(2, 6, 6, 12, 17, 25, 32)

The median is 12 and the mean is 14.28.


Constant “b” in the formula above is depending on the distribution. b=1.4826 when dealing with normally distributed data, but we’ll need to calculate a new “b” if a different underlying distribution is assumed:

b = 1/Q(0.75) (0.75 quantile of that underlying distribution)

To calculate the MAD, we find the median of absolute deviations from the median. In other words, the MAD is the median of the absolute values of the residuals (deviations) from the data’s median.

Using the same set from earlier:

  1. [(2 – 12), (6 – 12), (6 – 12), (12 – 12), (17 – 12), (25 – 12) ,(32 – 12)] Subtract median from each i
  2. |[-10, -6, -6, 0, 5, 13, 20]| Take the absolute value of the list
  3. [10, 6, 6, 0, 5, 13, 20] Find the median
  4. [10, 6, 6, 0, 5, 13, 20] -> [0, 5, 6, 6, 10, 13, 20] -> 6
  5. 6 * b ->  6 * 1.4826 = 8.8956

We now have our MAD (8.8956) to use in our predetermined threshold. Going back to our example set’s median of 12 we can use +/- 2 or 2.5 or 3 MAD. For example:
12 + 2*8.8956 = 29.7912 as out upper threshold
12 – 2*8.8956 = -5.7912 as out lower threshold

Using this criteria we can identify 32 as an outlier in our example set of [2, 6, 6, 12, 17, 25 ,32].

R code for MAD

mad(x, center = median(x), constant = 1.4826, na.rm = FALSE, low = FALSE, high = FALSE)


Standard Deviation and Variation in R



First Coursera Course: Data Analysis and Statistical Inference

When I decided to add “Learning R and Python” into this year’s to-do list, I started searching for learning materials. There is one open source text book called “Open Intro Stat” in my opinion is well organized and come with several labs that will walk you through fundamentals of R. And it’s very interesting how new resources and informations just pops up in front of you when you’ve tuned yourself in certain frequency (I guess that what people called the Law of Secret).

There is this Data Analysis and Statistical Inference class offered by the online education website called Coursera. Online education is a pretty new experience to me. So far I’m enjoying it. Usually there are few common components in a course. Video lectures divided into weeks, a quiz that briefly test your knowledge based on each week’s lecture, assignment, labs, or project.

We also have to do a project for this class. The due date for the proposal is March 10th. So I still have about a week or so to make up my mind what to do. We can use either the dataset this course provided or choose our own. Kaggle has some very interesting competitions and I probably could dig some fun topic and data. A more real-life case should be more beneficial to me, but I’m not sure how much business or economics knowledge is required.