R: from data science language to complementing GIS analysis?

library(nlme)
library("dygraphs")

Berlin School of Economics and Law

BSEL

Data Science Activities

Outline

R intro/overview/advocacy
Examples of non-spatial data explorations
Light spatial data visualization
A more serious look at spatial data in R
(Optional) Interactive Apps with spatial data
research ideas

Why R is the best data science language to learn today

R consistently ranks among the best languages

IEEE: R ranks #5
O’Reilly: R is arguably the most common data programming language
Redmonk: R is #12
TIOBE: R ranks high with consistent upward trend

consistent upward trend

R is excellent for learning data science

R is a true “data language”

R is a language that has statistics and data built into its DNA, so to speak.

In this sense, R is nearly unique among programming languages. It is a language that has been built for statistics. It’s been designed for data.

This has advantages when you’re learning data science, because almost any statistical test or technique can be found somewhere within base R or one of its packages.

The best books and resources use R

This is important. If you’re a beginner, and you’re just getting started in data science, you’ll have a lot to learn. To truly master data science, you’ll need to learn several sub-areas like probability, statistics, data visualization, data manipulation, and machine learning. All of these skill areas have theoretical foundations (which you’ll need to learn) but also practical techniques that you’ll need to execute by writing code.

The best books and resources use R

Learn Data Visualization in R
Learn Probability with R
Learn frequent statistics with R
Learn Bayesian statistics with R
Learn machine learning with R

Strong in Academics AND Industry

R is in heavy use at several of the best companies who are hiring data scientists.

Google
Facebook
Microsoft
Bank of America, Ford, TechCrunch, Uber, and Trulia

As Revolution Analytics recently noted, “R is also the tool of choice for data scientists at Microsoft, who apply machine learning to data from Bing, Azure, Office, and the Sales, Marketing and Finance departments.”

Beyond tech giants like Google, Facebook, and Microsoft, R is widely in use at a wide range of companies including Bank of America, Ford, TechCrunch, Uber, and Trulia.

Media Exposure

Community

Reproducible Research

“The term reproducible research refers to the idea that the ultimate product of academic research is the paper along with the full computational environment used to produce the results in the paper such as the code, data, etc. that can be used to reproduce the results and create new work based on the research”

Science Magazine

Open Source

e.g. state of the art boosting library gbm

gbm Cpp sources

Enough “hot air”

non spatial data, example

Ex1: Time Series Analysis

Interactive Data Exploration

dygraph(Global.ts) %>%  dyRangeSelector()

Yearly Data

plot(Global.annual);grid()

Global Warming ?

Last35 <- window(Global.ts, start=c(1970, 1), end=c(2005, 12))
 Last35Yrs <- time(Last35)
 fitAD=lm(Last35 ~ Last35Yrs)
printShortsummary(summary(fitAD),TableOnly=TRUE)

##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -34.920409   1.164899  -29.98   <2e-16 ***
## Last35Yrs     0.017654   0.000586   30.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 par(mar=c(8,3,0,0));plot(Last35); abline(fitAD,col=2)

Standard Errors incorrect

Generalized Least Squares

x.gls <- gls(Last35 ~ Last35Yrs, cor = corAR1(0.8))
confint(x.gls)

##                    2.5 %       97.5 %
## (Intercept) -39.80571504 -28.49659109
## Last35Yrs     0.01442275   0.02011148

par(mar=c(7,3,1,1));
pacf(fitAD$residuals,lag.max = 10)

Ex2: Interactive Charts

World Economic Data

#demo(WorldBank);save(M,file="worldBank.rda")
#load("worldBank.rda")
#plot(M)
#print(M, file="figures/WorldBank.html")

http:/localhost:8000/WorldBank.html

Ex3: Crime Data

kaggle crime data

Overall Trends

kaggle trend

Ex3: Daily Pattern

kaggle GAM2

Weekhour Patterns

kaggle GAM1

Appendix

This IEEE ranking system uses a set of 12 metrics, including things like Google search volume, Google trends, Twitter hits, Github repositories, Hacker News posts, and more

Keep in mind that the TIOBE index is structured to be “an indicator of the popularity of programming languages. The index is updated once a month. The ratings are based on the number of skilled engineers world-wide, courses and third party vendors. Popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube and Baidu are used to calculate the ratings.”

Another frequently sited language ranking system is the Redmonk Programming Language Rankings, which are derived from popularity on GitHub (lines of code) and popularity on Stack Overflow (number of tags).

R: from data science language to complementing GIS analysis?

M Loecher

30 Januar 2017

Berlin School of Economics and Law

Data Science Activities

Outline

Why R is the best data science language to learn today

R consistently ranks among the best languages

consistent upward trend

R is excellent for learning data science

R is a true “data language”

The best books and resources use R

The best books and resources use R

Strong in Academics AND Industry

Media Exposure

Community

Reproducible Research

Open Source

e.g. state of the art boosting library gbm

non spatial data, example

Ex1: Time Series Analysis

Interactive Data Exploration

Yearly Data

Global Warming ?

Standard Errors incorrect

Generalized Least Squares

Ex2: Interactive Charts

World Economic Data

Ex3: Crime Data

Overall Trends

Ex3: Daily Pattern

Weekhour Patterns

Appendix