library(nlme)
library("dygraphs")

Berlin School of Economics and Law

BSEL

Data Science Activities

BIPM at BSEL

Outline

Why R is the best data science language to learn today

R consistently ranks among the best languages

  • IEEE: R ranks #5
  • O’Reilly: R is arguably the most common data programming language
  • Redmonk: R is #12
  • TIOBE: R ranks high with consistent upward trend

consistent upward trend

R is excellent for learning data science

R is a true “data language”

R is a language that has statistics and data built into its DNA, so to speak.

In this sense, R is nearly unique among programming languages. It is a language that has been built for statistics. It’s been designed for data.

This has advantages when you’re learning data science, because almost any statistical test or technique can be found somewhere within base R or one of its packages.

The best books and resources use R

This is important. If you’re a beginner, and you’re just getting started in data science, you’ll have a lot to learn. To truly master data science, you’ll need to learn several sub-areas like probability, statistics, data visualization, data manipulation, and machine learning. All of these skill areas have theoretical foundations (which you’ll need to learn) but also practical techniques that you’ll need to execute by writing code.

The best books and resources use R

Strong in Academics AND Industry

R is in heavy use at several of the best companies who are hiring data scientists.

As Revolution Analytics recently noted, “R is also the tool of choice for data scientists at Microsoft, who apply machine learning to data from Bing, Azure, Office, and the Sales, Marketing and Finance departments.”

Beyond tech giants like Google, Facebook, and Microsoft, R is widely in use at a wide range of companies including Bank of America, Ford, TechCrunch, Uber, and Trulia.

Media Exposure

Community

Reproducible Research

“The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research”

Science Magazine

Open Source

e.g. state of the art boosting library gbm

gbm Cpp sources gbm Cpp sources gbm Cpp sources


Enough “hot air”

non spatial data, example

Ex1: Time Series Analysis

Interactive Data Exploration

dygraph(Global.ts) %>%  dyRangeSelector() 

Yearly Data

plot(Global.annual);grid()

Global Warming ?

Last35 <- window(Global.ts, start=c(1970, 1), end=c(2005, 12))
 Last35Yrs <- time(Last35)
 fitAD=lm(Last35 ~ Last35Yrs)
printShortsummary(summary(fitAD),TableOnly=TRUE)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -34.920409   1.164899  -29.98   <2e-16 ***
## Last35Yrs     0.017654   0.000586   30.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 par(mar=c(8,3,0,0));plot(Last35); abline(fitAD,col=2)

Standard Errors incorrect

Generalized Least Squares

x.gls <- gls(Last35 ~ Last35Yrs, cor = corAR1(0.8))
confint(x.gls)
##                    2.5 %       97.5 %
## (Intercept) -39.80571504 -28.49659109
## Last35Yrs     0.01442275   0.02011148
par(mar=c(7,3,1,1));
pacf(fitAD$residuals,lag.max = 10)

Ex2: Interactive Charts

World Economic Data

#demo(WorldBank);save(M,file="worldBank.rda")
#load("worldBank.rda")
#plot(M)
#print(M, file="figures/WorldBank.html")

http:/localhost:8000/WorldBank.html

Ex3: Crime Data

kaggle crime data

Ex3: Daily Pattern

kaggle GAM2

Weekhour Patterns

kaggle GAM1

Appendix

This IEEE ranking system uses a set of 12 metrics, including things like Google search volume, Google trends, Twitter hits, Github repositories, Hacker News posts, and more

Keep in mind that the TIOBE index is structured to be “an indicator of the popularity of programming languages. The index is updated once a month. The ratings are based on the number of skilled engineers world-wide, courses and third party vendors. Popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings.”

Another frequently sited language ranking system is the Redmonk Programming Language Rankings, which are derived from popularity on GitHub (lines of code) and popularity on Stack Overflow (number of tags).