Uncategorized

Job Trends in the Analytics Market: New, Improved, now Fortified with C, Java, MATLAB, Python, Julia and Many More!

r4stats.com

I’m expanding the coverage of my article, The Popularity of Data Analysis Software. This is the first installment, which includes a new opening and a greatly expanded analysis of the analytics job market. Here it is, from the abstract onward through the first section…

Abstract: This article presents various ways of measuring the popularity or market share of software for analytics including: Alteryx, Angoss, C / C++ / C#, BMDP, Cognos, Java, JMP, Lavastorm, MATLAB, Minitab, NCSS, Oracle Data Mining, Python, R, SAP Business Objects, SAP HANA, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., TIBCO Spotfire, SPSS, Stata, Statistica, Systat, Tableau, Teradata Miner, WEKA / Pentaho. I don’t attempt to differentiate among variants of languages such as R vs. Revolution…

View original post 2,161 more words

R note: quantiles, averages, standard deviations

    To get a summary from most basic R statistics you may enter

> summary(dataset$variable)

The typical output the summary() function gives include:

Min, 1st Qu, Median, Mean, 3rd Qu. Max.

Minimum, Maximum and Range in R

>min(dataset$variable)

>max(dataset$variable)

>range(dataset$variable)

Mean

>mean(dataset$variable)

Median

>median(dataset$variable)

Quantiles

There are 4 basic quantiles in every data collection.

>quantile(dataset$variable, 1/4)  #Gives the first quantile

>quantile(dataset$variable, 2/4)  Gives the second quantile

>quantile(dataset$variable, 3/4)  Gives the third quantile

>quantile(dataset$variable, 4/4)  Gives the fourth quantile

Mean Absolute Deviation in R

mad3

Median Absolute Deviation (MAD) or Absolute Deviation Around the Median is a robust measure of central tendency (the most common measures of central tendency are the arithmetic mean, the median and the mode).

Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions. Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. The interquartile range is also resistant to the influence of outliers, although the mean and median absolute deviation are better in that they can be converted into values that approximate the standard deviation.

Essentially the breakdown point for a parameter (median, mean, variance, etc.) is the proportion or number of arbitrarily small or large extreme values that must be introduced into a sample to cause the estimate to yield an arbitrarily bad result. The median’s breakdown point is .5 or half (the mean’s is 0). This means that the median only becomes “bad” when more than 50% of the observations are infinite.

example

set <- c(2, 6, 6, 12, 17, 25, 32)

The median is 12 and the mean is 14.28.

mad2

Constant “b” in the formula above is depending on the distribution. b=1.4826 when dealing with normally distributed data, but we’ll need to calculate a new “b” if a different underlying distribution is assumed:

b = 1/Q(0.75) (0.75 quantile of that underlying distribution)

To calculate the MAD, we find the median of absolute deviations from the median. In other words, the MAD is the median of the absolute values of the residuals (deviations) from the data’s median.

Using the same set from earlier:

  1. [(2 – 12), (6 – 12), (6 – 12), (12 – 12), (17 – 12), (25 – 12) ,(32 – 12)] Subtract median from each i
  2. |[-10, -6, -6, 0, 5, 13, 20]| Take the absolute value of the list
  3. [10, 6, 6, 0, 5, 13, 20] Find the median
  4. [10, 6, 6, 0, 5, 13, 20] -> [0, 5, 6, 6, 10, 13, 20] -> 6
  5. 6 * b ->  6 * 1.4826 = 8.8956

We now have our MAD (8.8956) to use in our predetermined threshold. Going back to our example set’s median of 12 we can use +/- 2 or 2.5 or 3 MAD. For example:
12 + 2*8.8956 = 29.7912 as out upper threshold
12 – 2*8.8956 = -5.7912 as out lower threshold

Using this criteria we can identify 32 as an outlier in our example set of [2, 6, 6, 12, 17, 25 ,32].

R code for MAD

mad(x, center = median(x), constant = 1.4826, na.rm = FALSE, low = FALSE, high = FALSE)

 

Standard Deviation and Variation in R

>sd(dataset$variable)

>var(dataset$variable)

First Coursera Course: Data Analysis and Statistical Inference

When I decided to add “Learning R and Python” into this year’s to-do list, I started searching for learning materials. There is one open source text book called “Open Intro Stat” in my opinion is well organized and come with several labs that will walk you through fundamentals of R. And it’s very interesting how new resources and informations just pops up in front of you when you’ve tuned yourself in certain frequency (I guess that what people called the Law of Secret).

There is this Data Analysis and Statistical Inference class offered by the online education website called Coursera. Online education is a pretty new experience to me. So far I’m enjoying it. Usually there are few common components in a course. Video lectures divided into weeks, a quiz that briefly test your knowledge based on each week’s lecture, assignment, labs, or project.

We also have to do a project for this class. The due date for the proposal is March 10th. So I still have about a week or so to make up my mind what to do. We can use either the dataset this course provided or choose our own. Kaggle has some very interesting competitions and I probably could dig some fun topic and data. A more real-life case should be more beneficial to me, but I’m not sure how much business or economics knowledge is required.

 

施一公的关于如何写好科研论文的几点经验

1, 先养成读英文文章的习惯,争取每天30-60分钟。

2, 最重要的是逻辑。逻辑的形成来自对实验数据的总体分析。必须先讨论出一套清晰的思路,然后按照思路来做图(figures),最后才能执笔。

3, 具体写作时,先按照思路(即figures)写一个以subheading为主的框架,然后开始具体写作。第一稿,切忌追求每一句话的完美,更不要追求词语的华丽,而主要留心逻辑(logic flow),注意前后句的逻辑关系、相邻两段的逻辑关系。写作时,全力以赴,尽可能不受外界事情干扰(关闭手机、座机),争取在最短时间内拿出第一稿。还要注意:一句话不可太长。

4,学会照葫芦画瓢。没有人天生会写优秀的科研论文,都是从别人那里学来的。学习别人的文章要注意专业领域的不同,有些领域(包括我所在的结构生物学)有它内在的写作规律。科研文章里的一些话是定式,比如“to investigate the mechanism of…, we performed…”,”these results support the former, but not the latter, hypothesis…”‘, “despite recent progress, how … remains to be elucidated…”等等。用两次以后,就逐渐学会灵活运用了。

在向别人学习时,切忌抄袭。

5,第一稿写完后,给自己不要超过一天的休息时间,开始修改第二稿。修改时,还是以逻辑为主,但对每一句话都要推敲一下,对abstract和正文中的关键语句要字斟句酌。学会用”thesaurus“ (同义词替换)以避免过多重复。

6,第二稿以后的修改,主要注重具体的字句,不会改变整体逻辑了。投稿前,一定要整体读一遍,对个别词句略作改动。[学术期刊一般不会因为具体的语法错误拒绝一篇文章,但一定会因为逻辑混乱而拒绝一篇文章]

论文只是一个载体,是为了向同行们宣告你的科研发现,是科学领域交流的重要工具。所以在科研论文写作时,一定要谨记于心的就是:用最简单的话表达最明白的意思,但一定要逻辑严谨!