nielsenmark.us

Mixing Up Your Office March Madness Competition

Thu, 14 Mar 2019 00:00:00 +0000

March Madness is almost here! Try mixing up your office bracket competition with a Calcutta auction instead. I think the original source of this idea came from a news article similar to this one. Of course, you can also make this (mostly) risk free by using points to bid on teams instead of money… our office typically has the biggest loser by the winner lunch.

Here’s how it works:

Get everyone together over lunch for 1 ¹⁄₂ to 2 hours on Monday to bid on the teams that they want to represent them for the tournament. Each person will start with 500 or so points (for example) to “purchase” teams at the auction. Everyone will earn 1 point back for each point their team scores in the NCAA tournament, and the owner of the NCAA tournament championship team will earn 100 bonus points. The person with the most points after the final will be the winner and gets all the glory. And possibly lunch… :)

I’ve tried to make this auction process a little easier using an R Shiny app that you can download using the following commands:

install.packages('devtools')
devtools::install_github('nielsenmarkus11/NCAAcalcutta')

Once you’ve installed the package you’ll need to import the latest bracket data (which will be available this upcoming Sunday) and load this into R. Because that data isn’t available yet, I’ve included example data from 2018. You’ll want to mimic the 2018 csv file to ensure that the app works properly. Once you are ready you can import the teams and start the Calcutta Shiny app.

# Input the 2018 teams
teams <- import_teams(system.file("extdata", "ncaa-teams.csv", package = "NCAAcalcutta"))
start_auction(teams, randomize=TRUE)

There you go, you’re all ready to get started!

Here are some general guidelines:

When the timer runs out, the bidding is over
Only allow bids in 5 point increments
In the app, each team has a minimum bid, this rule can be loosened at the end if nobody has points left to spend.

This is one additional rule I made up to make it fair if someone realized they spent too much after the fact:

If you bid more than you have, you will have 10 points per overspent point deducted from your final score.

I hope you all have as much fun with this as our team does. Comment below for any additional clarification. And happy bidding!

FAQ

Q: How do I determine how many points each participant starts with?

A: Typically it works well to (1) take the total points scored in last years tournament, (2) multiply that by 0.8 and (3) divide by the number of participants. You can probably round up to the nearest 25 points and you should be okay.

Exploring Models with lime

Fri, 09 Nov 2018 00:00:00 +0000

Recently at work I’ve been asked to help some clinicians understand why my risk model classifies specific patients as high risk. Just prior to this work I stumbled across the work of some data scientists at the University of Washington called lime. LIME stands for “Local Interpretable Model-Agnostic Explanations”. The idea is that I can answer those questions I’m getting from clinicians for a specific patient by locally fitting a linear (aka “interpretable”) model in the parameter space just around my data point. I decided to pursue lime as a solution and the last few months I’ve been focusing on implementing this explainer for my risk model. Happily, I also discovered an R package that implements this solution that originated in python.

Sample Data

So the first step to this blog was to find some public data for illustration. I remembered an example used in an Introduction to Statistical Learning by James, Witten, Hastie and Tibshirani. I will use the Heart.csv data which can be downloaded using the link in the code below:

library(readr)
library(ranger)
library(tidyverse)
library(lime)

dat <- read_csv("http://www-bcf.usc.edu/~gareth/ISL/Heart.csv")
dat$X1 <- NULL

Now let’s take a quick look at the data:

Hmisc::describe(dat)

## dat 
## 
##  14  Variables      303  Observations
## ---------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       41    0.999    54.44     10.3       40       42 
##      .25      .50      .75      .90      .95 
##       48       56       61       66       68 
## 
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## ---------------------------------------------------------------------------
## Sex 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      303        0        2    0.653      206   0.6799   0.4367 
## 
## ---------------------------------------------------------------------------
## ChestPain 
##        n  missing distinct 
##      303        0        4 
##                                                               
## Value      asymptomatic   nonanginal   nontypical      typical
## Frequency           144           86           50           23
## Proportion        0.475        0.284        0.165        0.076
## ---------------------------------------------------------------------------
## RestBP 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       50    0.995    131.7    19.41      108      110 
##      .25      .50      .75      .90      .95 
##      120      130      140      152      160 
## 
## lowest :  94 100 101 102 104, highest: 174 178 180 192 200
## ---------------------------------------------------------------------------
## Chol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0      152        1    246.7    55.91    175.1    188.8 
##      .25      .50      .75      .90      .95 
##    211.0    241.0    275.0    308.8    326.9 
## 
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## ---------------------------------------------------------------------------
## Fbs 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      303        0        2    0.379       45   0.1485   0.2538 
## 
## ---------------------------------------------------------------------------
## RestECG 
##        n  missing distinct     Info     Mean      Gmd 
##      303        0        3     0.76   0.9901    1.003 
##                             
## Value          0     1     2
## Frequency    151     4   148
## Proportion 0.498 0.013 0.488
## ---------------------------------------------------------------------------
## MaxHR 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       91        1    149.6    25.73    108.1    116.0 
##      .25      .50      .75      .90      .95 
##    133.5    153.0    166.0    176.6    181.9 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## ---------------------------------------------------------------------------
## ExAng 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      303        0        2     0.66       99   0.3267   0.4414 
## 
## ---------------------------------------------------------------------------
## Oldpeak 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       40    0.964     1.04    1.225      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.8      1.6      2.8      3.4 
## 
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## ---------------------------------------------------------------------------
## Slope 
##        n  missing distinct     Info     Mean      Gmd 
##      303        0        3    0.798    1.601   0.6291 
##                             
## Value          1     2     3
## Frequency    142   140    21
## Proportion 0.469 0.462 0.069
## ---------------------------------------------------------------------------
## Ca 
##        n  missing distinct     Info     Mean      Gmd 
##      299        4        4    0.783   0.6722   0.9249 
##                                   
## Value          0     1     2     3
## Frequency    176    65    38    20
## Proportion 0.589 0.217 0.127 0.067
## ---------------------------------------------------------------------------
## Thal 
##        n  missing distinct 
##      301        2        3 
##                                            
## Value           fixed     normal reversable
## Frequency          18        166        117
## Proportion      0.060      0.551      0.389
## ---------------------------------------------------------------------------
## AHD 
##        n  missing distinct 
##      303        0        2 
##                       
## Value         No   Yes
## Frequency    164   139
## Proportion 0.541 0.459
## ---------------------------------------------------------------------------

Our target variable in this data is AHD. This flag identifies whether or not a patient has Coronary Artery Disease. If we can predict this accurately, clinicians could probably better treat these patients and hopefully help them avoid the symptoms of AHD like chest pain or worse, heart attacks.

Data Wrangling

For a predictive model I’ve opted to use a random forest model using the ranger implmentation which parallelizes the random forests algorithm in R. But first, some data cleaning is necessary. After replacing missing values, I’m going to split the data into test and training dataframes.

# Replace missing values
dat$Ca[is.na(dat$Ca)] <- -1
dat$Thal[is.na(dat$Thal)] <- "missing"

## 75% of the sample size
smp_size <- floor(0.75 * nrow(dat))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(dat)), size = smp_size)

train <- dat[train_ind, ]
test <- dat[-train_ind, ]

mod <- ranger(AHD~., data=train, probability = TRUE, importance = "permutation")

mod$prediction.error

## [1] 0.1326235

Our quick and dirty check of the OOB prediction error tells us that our model appears to be doing okay at predicting AHR. Now the trick is to describe to our physicians and nurses why we believe someone is high risk for AHR. Before I learned of lime, I would have probably done something similar to the code below by first looking at which variables were most important in my trees.

plot_importance <- function(mod){
  tmp <- mod$variable.importance
  dat <- data.frame(variable=names(tmp),importance=tmp)
  ggplot(dat, aes(x=reorder(variable,importance), y=importance))+ 
    geom_bar(stat="identity", position="dodge")+ coord_flip()+
    ylab("Variable Importance")+
    xlab("")
}

# Plot the variable importance
plot_importance(mod)

After this, I probably would have taken a look at some partial dependence plots to get an idea of how those important variables are changing over the range of that variable. However, often the weakness of this approach is that I need to hold all other variables constant. And if I truly believe there are interactions between my variables, the partial dependence plot could change dramatically when other variables are changed.

Explain the model with LIME

Enter lime. As discussed above, the entire purpose of lime is to provide a local interpretable model to help us understand how our prediction would change if we tweak the other variables slightly in a lot of permutations. The first step to using lime in this specific case is to add some functions so that the lime package knows how to deal with the output of the ranger package. Once I have these I can use the combination of the lime() and explain() functions to get what I need. As in all multivariate linear models, we still have an issue… correlated explanatory varaibles. And depending on the number of variables in our original model, we may need to pair down our models to only look at the most “influential” or “important” variables. By default lime is going to use either forward-selection or pick the variables with the larges coefficients after correcting for multicollinearity using ridge regression or L2 penalization. As seen below, you can also select variables for the explanation using Lasso (aka L1 penalization) or use xgboost most important variables using the "tree" method.

# Train LIME Explainer
expln <- lime(train, model = mod)


preds <- predict(mod,train,type = "response")
# Add ranger to LIME
predict_model.ranger <- function(x, newdata, type, ...) {
  res <- predict(x, data = newdata, ...)
  switch(
    type,
    raw = data.frame(Response = ifelse(res$predictions[,"Yes"] >= 0.5,"Yes","No"), stringsAsFactors = FALSE),
    prob = as.data.frame(res$predictions[,"Yes"], check.names = FALSE)
  )
}

model_type.ranger <- function(x, ...) 'classification'


reasons.forward <- explain(x=test[,names(test)!="AHD"], explainer=expln, n_labels = 1, n_features = 4)
reasons.ridge <- explain(x=test[,names(test)!="AHD"], explainer=expln, n_labels = 1, n_features = 4, feature_select = "highest_weights")
reasons.lasso <- explain(x=test[,names(test)!="AHD"], explainer=expln, n_labels = 1, n_features = 4, feature_select = "lasso_path")
reasons.tree <- explain(x=test[,names(test)!="AHD"], explainer=expln, n_labels = 1, n_features = 4, feature_select = "tree")

Note: Using the current version of lime you may have issues with the feature_select = "lasso_path" option. To get the above code to run above you can install my tweaked version of lime here.

Plotting explanations

Now that we have all the explanations, one of my favorite features in the lime package is the plot_explanations() function. You can easily show the most important variables for each of our selection methods above and we can see that they are all very consistent in the choice of the top 4 most influential variables in predicting AHD.

plot_explanations(reasons.forward)

plot_explanations(reasons.ridge)

plot_explanations(reasons.lasso)

plot_explanations(reasons.tree)

Thanks for reading this quick tutorial on lime. There is much more of this package that I want to explore. Particulary its use for image and text classifications. Then the only real question left is… How do I get one of those cool hex stickers for lime? ;)

Connecting R to PostgreSQL on Linux

Sat, 07 Jul 2018 00:00:00 +0000

Connecting to databases is a critical piece of data anlaysis in R. In most analytic roles the data we consume is going to be found in databases. Of these some of the most common are SQL databases like MS SQL Server, PostgreSQL, and Oracle in addition to many others. In this how-to blog, I’ll walk you through the major steps of configuring your machine and R to be able to connect to a PostgreSQL Server database from R on Ubuntu using the RPostgreSQL, odbc, and RJDBC packages in R. Similar steps can be followed to set up connections to other databases, however, driver installation and configuration will likely be slightly different.

1 - RPostgreSQL Package Setup

The first step in setting up a connection to a PostgreSQL database is to first download the PostgreSQL header files and static library, libpq-dev. In order to do this on Ubuntu open the terminal and install it using the following command:

sudo apt-get install libpq-dev

Once the libpq-dev package is installed the next step is to install the RPostgreSQL package in R. If you need to authenticate, I highly recommend the getPass package which will prompt you for your password. RStudio also has a .rs.askForPassword() function that works similar to the getPass() function, but it relies on using RStudio. I’ve confirmed that getPass works in bash, emacs, RStudio, and when knitting your Rmd files. So however you submit your R code it will work the same.

# Install the package in R
install.packages("RPostgreSQL")

library(RPostgreSQL)

## Loading required package: DBI

library(getPass)
pgdrv <- dbDriver(drvName = "PostgreSQL")

db <-DBI::dbConnect(pgdrv,
                    dbname="postgres",
                    host="localhost", port=5432,
                    user = 'postgres',
                    password = getPass("Enter Password:"))

## Please enter password in TK window (Alt+Tab)

# Write to database
DBI::dbWriteTable(db, "mtcars", mtcars)

## [1] TRUE

DBI::dbDisconnect(db)

## [1] TRUE

Perfect! Your database connection should be working simply by adding the proper arguments in your dbConnect() function. You may need to tweak the host, port and user based on your PostgreSQL server setup.

2 - odbc Package Setup

In case you are a fan of odbc, the next section will walk you through the steps of creating your database connection via odbc.

In the past I have used the RODBC package but recently I have found that the odbc package plays much nicer with other database tools like DBI and dbplyr. Plus it has very similar syntax to the RJDBC package and for consistency sake I’ve made the switch.

Once again, the first step is to install the necessary debian packages. In this case we need to install the unixodbc and unixodbc-dev packages and the odbc-postgresql driver.

# Install the unixODBC library
apt-get install unixodbc unixodbc-dev

# PostgreSQL ODBC Drivers
apt-get install odbc-postgresql

Set up connection with connection string

Okay we are now ready to connect via odbc. Note the slight difference in the names of the arguments of the dbConnect() function.

db <- DBI::dbConnect(odbc::odbc(),
                     Driver = "PostgreSQL Unicode",
                     Database = "postgres",
                     UserName = "postgres",
                     Password = getPass("Enter Password:"),
                     Servername = "localhost",
                     Port = 5432)

## Please enter password in TK window (Alt+Tab)

Set up connection with DSN

If you don’t want to have to worry about defining each of these arguments each time you connect to PostgreSQL via odbc, you can define the configuration in your odbcinst.ini file. The following steps walk you through the process:

Make sure the /etc/odbcinst.ini has the drivers set up. This should have been configured automatically when installing odbc-postgresql with apt-get. This is what it would look like:
```
[PostgreSQL Unicode]
Driver = psqlodbca.so
Setup = libodbcpsqlS.so
Debug = 0
CommLog = 1
UsageCount = 1
```

Now define your DSN by modifying the odbc.ini file:

[PostgreSQL]
Driver = PostgreSQL Unicode
Database = postgres
Servername = localhost
UserName = postgres
Password = postgres

Connect to your database by referencing your DSN name specified in the square brackets of the odbc.ini file:

# Connect to the database
db <- dbConnect(odbc::odbc(), "PostgreSQL")

# Pull the Data into an R dataframe
DBI::dbGetQuery(db,"SELECT * FROM MTCARS")

##              row.names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

# Close the Connection
DBI::dbDisconnect(db)

Now you are ready to begin your analysis with your data!

3 - RJDBC Package Setup

Finally, the last way to configure a connection to the PostgreSQL database can be done via the RJDBC package. The first step in this configuration is to download the jdbc jar file from here. I’ve put this in my home directory, ~, and will reference this file in the JDBC() function below. Once you have the jar file you can install the RJDBC package in R.

install.packages('RJDBC')

Now you are ready to connect. Once again, notice the slight tweaks to the arguments of the dbConnect() function. Because I’m defining the url argument with the host, port and database name, there is no need for these additional arguments.

library(RJDBC)

## Loading required package: rJava

db <- DBI::dbConnect(RJDBC::JDBC("org.postgresql.Driver","~/postgresql-42.2.2.jar"),
               url = "jdbc:postgresql://localhost:5432/postgres",
               user = "postgres",
               password = getPass("Enter Password:"))

## Please enter password in TK window (Alt+Tab)

# Pull the Data into an R dataframe
DBI::dbGetQuery(db,"SELECT * FROM MTCARS")

##              row.names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

# Close the Connection
DBI::dbDisconnect(db)

## [1] TRUE

Alright! We’ve walked through several different configurations in connecting to a PostgreSQL database on Ubuntu. You’ll only need one of these setups, but I think it’s nice to understand each of your options so you can create the best setup that works for you and/or your organization.

Setting up an ODBC connection with MS SQL Server on Windows

Fri, 01 Jun 2018 00:00:00 +0000

Connecting to databases is a critical piece of data anlaysis in R. In most analytic roles the data we consume is going to be found in databases. Of these some of the most common are SQL databases like MS SQL Server, PostgreSQL, and Oracle in addition to many others. In this how-to blog, I’ll walk you through the major steps of configuring your machine and R to be able to connect to a MS SQL Server database from R on Windows. Similar steps can be followed to set up connections to other databases, however, driver installation and configuration will likely be slightly different.

Downloading and Installing the Drivers

The first step is to download the necessary odbc drivers for your database. Because most Windows installations come with the MS SQL Server drivers installed we’ll breeze over this step. If you don’t have it installed you can follow these directions here.

Setting up a DSN for your ODBC Connection

This step is not necessary, but I have found that configuring a DSN (aka. “Data Source Name”) can simplify your code configuration in R.

STEP 1: Search “ODBC” in the Start Menu search and open “ODBC Data Source Administrator (64-bit)”.

Step 2: Select “Add” under the “User DSN” tab.

Step 3: Select the corresponding ODBC driver for which you wish to set up a data source and Click “Finish”.

Step 4: Give your DSN a “Name” and “Server” name/IP address and click “Next”.

Step 5: Define your default database and click “Next”.

Step 6: Click “Next” through any remaining windows, then click “Finish”. A window should pop up to test the connection. Double check your options then click “Test Data Source”.

Step 7: If it was successful it should give you the following message. Click “OK”.

Step 8: Finally you should see your newly defined DSN listed under the “User DSN” tab. Click “OK” to exit the ODBC DSN configuration tool.

Install the `odbc` Package in R

install.packages('odbc')

Connecting to the Database from R

Alright, we are ready to make our connection… drum-roll please. To start let’s make our connection using the DNS configuration we set up earlier.

library(odbc)
library(dplyr)
library(dbplyr)

# Connect using the DSN
db <- DBI::dbConnect(odbc::odbc(), "SQL")

That was easy! Now we’re ready to roll with our data. If you opted out of creating a DSN, the below code is what you would use to connect. There are a lot more key strokes but the bonus is that there is no additional setup needed outside of R, which can be handy when you are trying to share your code with coworkers that want to connect to the database too.

# Connect without a DSN
db <- DBI::dbConnect(odbc::odbc(),
                     Driver = 'ODBC Driver 13 for SQL Server',
                     Server = 'localhost\\SQLEXPRESS',
                     Database = "master",
                     trusted_connection = 'yes',
                     Port = 1433
                     )

Okay, now that we are connected we are ready to get started on our analysis. We can read/write data to the database using the follwing commands:

# Write iris data to MS SQL Server
# DBI::dbWriteTable(db,"iris",iris)

# Read data from MS SQL Server
my.iris <- DBI::dbGetQuery(db,"SELECT * FROM IRIS")
head(my.iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Finally, use the dbplyr package to extend the dplyr functions to our database connection.

smry <- tbl(db,"iris") %>% collect
head(smry)

## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

# Don't forget to disconnect
dbDisconnect(db)

Forecasting PM2.5 with forecast and prophet

Wed, 21 Feb 2018 00:00:00 +0000

Time series, the course I often wish I had taken while completing my coursework in school. I finally got an excuse to do a comparitive dive into the different time series models in the forecast package in R thanks to an invitation to present at a recent Practical Data Science Meetup in Salt Lake City.

In the following exercises, I’ll be comparing OLS and Random Forest Regression to the time series models available in the forecast package. In addition to this I’ll be taking a look at the fairly new prophet package released by facebook for R. Alright, let’s load some packages to get started.

library(tidyverse)
library(gridExtra)
library(lubridate)
library(leaflet)
library(randomForest)
library(forecast)
library(prophet)

load("../../../time-series/data/ts-dat.Rdat")

Data Collection

The pollution data I’ll be using for this examples comes from epa.gov and the weather data comes from ncdc.noaa.gov. You can access my R data object on my github page. Salt Lake City for many years has experienced population growth which has exasterbated the inversion problem. Inversion creates a “cap” over Utah valleys trapping pollutants on the valley floors which creates many public health issues because of the thick smog.

Below is an map indicating 4 sites where data is being collected on pollution levels. I will be focusing particulary on PM2.5 measures across the Salt Lake Valley. I’ve also downloaded weather data from both the valley floor at SLC International Airport and in a meadow near Grand View peak in the Wasatch mountains. These two sites’ temperatures can be used to compute whether the temperatures are inverted.

OLS Regression

First, let’s take a look at how well our weather regressors are at predicting PM2.5 levels without considering autocorrelation or seasonality. Below, we will fit our model and look at our residuals to make sure our assumptions of normality and independence are met:

fit1 <- lm(sqrt(pm2.5)~inversion+wind+precip+fireworks,data=dat)
summary(fit1)

## 
## Call:
## lm(formula = sqrt(pm2.5) ~ inversion + wind + precip + fireworks, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1571 -0.5555 -0.1835  0.3608  4.4629 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.322431   0.066358  50.068  < 2e-16 ***
## inversion    2.527237   0.130122  19.422  < 2e-16 ***
## wind        -0.040543   0.003255 -12.454  < 2e-16 ***
## precip      -0.515741   0.175563  -2.938  0.00336 ** 
## fireworks    0.545624   0.116089   4.700 2.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8791 on 1456 degrees of freedom
## Multiple R-squared:  0.3165, Adjusted R-squared:  0.3146 
## F-statistic: 168.5 on 4 and 1456 DF,  p-value: < 2.2e-16

dat$resid[!is.na(dat$pm2.5)] <- resid(fit1)

# Plot the residuals
ggplot(dat,aes(date,resid)) + 
  geom_point() + geom_smooth() +
  ggtitle("Linear Regression Residuals",
          subtitle = paste0("RMSE: ",round(sqrt(mean(dat$resid^2,na.rm=TRUE)),2)))

Okay, so when we review the model we see that the variables are somewhat useful in predicting PM2.5 levels, however our r-squared values are not that impressive. Also, looking at our residuals, we can see that there is still something going on that we haven’t accounted for. There appears to be a yearly pattern in the residuals. As for investigating dependece between the PM2.5 data points, let’s use the autocorrelation function, Acf() available in the forecast package:

Acf(dat$resid, main="ACF of OLS Residuals")

Here we can see that the data is correlated up through 20 or more days in the past. This definitely violates our assumption of independence.

Random Forest Regression

Random Forest models don’t have as many assumptions as OLS Regression, so let’s try this model to see if we can do any better. Initially I’ll be using the training Root Mean Squared Errors (RMSE) to compare models. However, later I will use time series cross-validation RMSE to compare each of the methods ability to predict future PM2.5 levels.

fit2 <- randomForest(sqrt(pm2.5)~inversion+wind+precip+fireworks,data=dat[!is.na(dat$pm2.5),], ntree=500)
dat$rf.resid[!is.na(dat$pm2.5)] <- fit2$predicted - sqrt(dat$pm2.5[!is.na(dat$pm2.5)])

# Plot the residuals
ggplot(dat,aes(date,rf.resid)) + 
  geom_point() + geom_smooth() +
  ggtitle("Random Forest Residuals",
          subtitle = paste0("RMSE: ",round(sqrt(fit2$mse[500]),2)))

# Better but we still have some odd things going on in our data

Once again, after looking at the residuals it still looks like something is going on here. We notice that there still appears to be a seasonal trend in our residuals. Let’s zoom in on the residual plots over time and take a look:

# Zoom In
p1 <- ggplot(dat,aes(date,rf.resid)) + 
  geom_point() + geom_line() +
  xlim(as.Date(c("2014-01-01","2014-02-28"))) + 
  geom_abline(slope=0, intercept = 0, lty=2, col = "blue", lwd = 1.25)

p2 <- ggplot(dat,aes(date,rf.resid)) + 
  geom_point() + geom_line() +
  xlim(as.Date(c("2017-11-01","2017-12-31"))) + 
  geom_abline(slope=0, intercept = 0, lty=2, col = "blue", lwd = 1.25)


grid.arrange(p1, p2, ncol=2, top="Zoom-in of Random Forest Residuals")

If you look closely it appears that the residuals are all negative for a time then they move to be all positive. From this we see that we still haven’t adjusted our model for the autocorrelation. To do this we’ll need to take a look at some time series models.

Exponential Smoothing

Okay, let’s get started with one of the more simple time series models, Exponential Smoothing. This is done by first converting our target column to a time series object using the ts() function. The ts() function also allows us to include a seasonal component to our data. We’ll start by setting frequency = 7 to include weekly seasonality in our daily PM2.5 measures. In this exercise, I will be fitting 3 different models. The default model argument is set to 'ZZZ' which will choose additive ('A'), multiplicative ('M'), or none ('N') for each of the errors, trend, and seasonality. Our automated model has chosen 'MAN'. Notice that this essentially removed the weekly seasonality which can be seen in the forecast below. I also fit models using all additive and all multiplicative for comparison.

# Convert to time series data
dat.ts <- sqrt(ts(dat[,"pm2.5"], frequency = 7))

# Exponential smoothing model with weekly seasonality
fit3 <- ets(dat.ts) # model = "MAN"
fit4a <- ets(dat.ts, model ="AAA")
fit4b <- ets(dat.ts, model ="MMM")
# Fit models with all additive or all multiplicative features. First byte is for errors, second for trend, and third for seasonality

Notice that similar to linear models, the predict() function is available but can also be used to forecast future values based on previous values by adding an argument for the horizon, h. Below, I’m using the automated ets model to predict 25 days into the future:

# Predict Future Values
plot(predict(fit3,h=25),xlim=c(200,215))

Now going back to our 3 models, we can take a look at the residuals now that we are adjusting for autocorrelation and weekly seasonality:

ets.mod <- rbind(data.frame(day=1:sum(!is.na(dat.ts)), resid=as.numeric(residuals(fit3)), type="Auto"),
                 data.frame(day=1:sum(!is.na(dat.ts)), resid=as.numeric(residuals(fit4a)), type="Additive"),
                 data.frame(day=1:sum(!is.na(dat.ts)), resid=as.numeric(residuals(fit4b)), type="Multiplicative"))

# Compare the residuals of each model
ggplot(ets.mod,aes(day,resid)) + 
  geom_point() + geom_smooth() + 
  facet_grid(type~.,scales="free")+
  ggtitle("ETS Residuals with Weekly Seasonality",
          subtitle = paste0("Auto RMSE: ",round(sqrt(fit3$mse),2),
                            "   Additive RMSE: ",round(sqrt(fit4a$mse),2),
                            "   Multiplicative RMSE: ",round(sqrt(fit4b$mse),2)))

There we go! Our residuals look much better, there still does appear to be some yearly seasonality that we can incorporate using some more sophisticated time series models. Let’s start with Rob Hyndman’s implementation of the TBATS model.

TBATS (Trigonometric regressors, Box-Cox transformations, ARMA errors, Trend, Seasonality)

Using the TBATS model is one way to incorporate multiple seasonality in our model. It’s going to automate the process of choosing a Box-Cox transformation for our target variable, PM2.5. You may have noticed that I’ve been taking the square root of PM2.5 in each of our previous models and this in part was due to the recommended Box-Cox parameter of 0.5 that came out of this model when I was first playing around with the tbats() function. This function will also automatically choose the parameters for the ARMA model and the fourier transforms for the seasonal trends.

# TBATS model with weekly and yearly seasonality
dat.ts2 <- sqrt(msts(dat[!is.na(dat$pm2.5),"pm2.5"], seasonal.periods=c(7,365.25)))
fit5 <- tbats(dat.ts2)
# This method takes the most time when comparing run time.
# Down side on this is that you cannot set specific box-cox, ARMA, and fourier parameters.

This time series model is easy to use and can be extremely useful when modeling mutiple seasonality and autoregressive features. I do wish the tbats() function would allow you to pass specific Box-Cox, ARMA, and fourier parameters for your model. This would make cross-validation of my models more convenient by allowing me to be able to set the specific model for each window.

Once again, you can see that predicting future values is made very easy with the predict() function and h parameter.

# Predict future values
plot(predict(fit5, h=25),xlim=c(4.8,5.2))

Lastly, let’s look at the residuals and see if adding both yearly and weekly seasonality have improved our predictions:

# Plot the residuals
tbats.mod <- data.frame(day=1:sum(!is.na(dat.ts2)),resid=as.numeric(residuals(fit5)))
ggplot(tbats.mod,aes(day,resid)) + 
  geom_point() + geom_smooth() + 
  ggtitle("TBATS Resids with Dual Seasonality",
          subtitle = paste0("Auto RMSE: ",round(sqrt(mean((residuals(fit5))^2)),2)))

Wow! This looks much better. This random cloud of data around the line y = 0 is typically what we are looking for in a good model fit. Notice also that the training RMSE is much better for this model.

ARIMA with Regressors (AutoRegressive Integraged Moving Average)

The last piece to time series models is being able to add regressors to the multiple seasonality and autocorrelation adjustments. The auto.arima() function can have all of these included in the model by using the fourier() transform function and the xreg argument.

In this portion of the exercise, because my regressors are also time series I need to make sure that I also forcast each of those regressors before using them to forecast the PM2.5 level.

# ARIMA with weekly and yearly seasonality with regressors
regs <- dat[!is.na(dat$pm2.5),c("precip","wind","inversion","fireworks")]

# Forecast weather regressors
weather.ts <- msts(dat[,c("precip","wind","inversion_diff")],seasonal.periods = c(7,365.25))
precip <- auto.arima(weather.ts[,1])
fprecip <- as.numeric(data.frame(forecast(precip,h=25))$Point.Forecast)
wind <- auto.arima(weather.ts[,2])
fwind <- as.numeric(data.frame(forecast(wind,h=25))$Point.Forecast)
inversion <- auto.arima(weather.ts[,3])
finversion <- as.numeric(data.frame(forecast(inversion,h=25))$Point.Forecast)

fregs <- data.frame(precip=fprecip,wind=fwind,inversion=as.numeric(finversion<0),fireworks=0)

# Seasonality
z <- fourier(dat.ts2, K=c(2,5))
zf <- fourier(dat.ts2, K=c(2,5), h=25)

# Fit the model
fit <- auto.arima(dat.ts2, xreg=cbind(z,regs), seasonal=FALSE)

# Predict Future Values
# This time we need future values of the regressors as well.
fc <- forecast(fit, xreg=cbind(zf,fregs), h=25)
plot(fc,xlim=c(4.8,5.2))

Again, the residuals do look much better than our residuals from the OLS and Random Forest Regression models.

# Plot the residuals
arima.mod <- data.frame(day=1:sum(!is.na(dat.ts)),resid=as.numeric(residuals(fit)))

ggplot(arima.mod,aes(day,resid)) + 
  geom_point() + geom_smooth() + 
  ggtitle("Arima Resids with Seasonality and Regressors",
          subtitle = paste0("RMSE: ",round(sqrt(mean((residuals(fit))^2)),2)))

prophet

And finally, let’s take a look at fitting a basic model using the prophet package. The prophet package is using STAN to to fit an additive model by including seasonality, autocorrelation, extra regressors, etc. One of the nice features of the prophet() function is that it will also automatically choose change points in your time series. The default number of change points is set to 25. This allows the time series models to be a little bit more robust in comparison to other models. Once again, I’m also using the prophet() forecast function to forecast my regressors that I’m passing into the final prophet model to predict PM2.5.

pdat <- data.frame(ds=dat$date,
                   y=sqrt(dat$pm2.5),
                   precip=dat$precip,
                   wind=dat$wind,
                   inversion_diff=dat$inversion_diff,
                   inversion=dat$inversion_,
                   fireworks=dat$fireworks)

# Forecast weather regressors
pfdat <- data.frame(ds=max(dat$date) + 1:25)
pprecip <- pdat %>% 
  select(ds,y=precip) %>% 
  prophet() %>%
  predict(pfdat)

## Initial log joint probability = -5.77805
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

pwind <- pdat %>% 
  select(ds,y=wind) %>% 
  prophet() %>%
  predict(pfdat)

## Initial log joint probability = -46.5575
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

pinversion <- pdat %>% 
  select(ds,y=inversion_diff) %>% 
  prophet() %>%
  predict(pfdat)

## Initial log joint probability = -55.0515
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

fdat <-  data.frame(ds=pfdat$ds,
                    precip=pprecip$yhat,
                    wind=pwind$yhat,
                    inversion=as.numeric(pinversion$yhat<0),
                    fireworks = 0)

# Fit the model (Seasonality automatically determined)
fit6 <- prophet() %>% 
  add_regressor('precip') %>% 
  add_regressor('wind') %>% 
  add_regressor('inversion') %>% 
  add_regressor('fireworks') %>% 
  fit.prophet(pdat)

## Initial log joint probability = -120.752
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

We also see that the predict funtion can also be used with the prophet model object to forecast future values by adding the future dataframe as a second argument to the predict() function.

# Forecast future values
forecast <- predict(fit6, fdat)

Looking at the residuals below, you can see that we’re starting to see some of the original seasonal trend showing slightly in the residuals that we saw previously in the OLS and Random Forest models.

# Get the residuals
fpred <- predict(fit6)
fpred$ds <- as.Date(fpred$ds)
fpred <- pdat %>% left_join(fpred,by="ds")
fpred$resid <- fpred$y - fpred$yhat

# Plot the residuals
ggplot(fpred,aes(ds,resid)) + 
  geom_point() + geom_smooth() + 
  ggtitle("Prophet with Seasonality and Regressors",
          subtitle = paste0("RMSE: ",round(sqrt(mean(fpred$resid^2)),2)))

Cross-Validation Comparison of Models

Okay, now that we’ve gone over the basics of each of the models as well as assessing the model fit, let’s compare how well the models predict future PM2.5 levels. This cross validation is performed by assigning a rolling window in our time series. We split this window into two pieces, the “initial” time period and the “horizon”. We fit our model using the initial time period and compare our prediction of the horizon to its actual values. I picked the RMSE as my loss function in evaluating predictive performance.

A typical comparison is to compute the RMSE for each of the days in your horizon by combining all the differences between ‘y’ and ‘yhat’ from each of your rolling validations:

# RMSE by horizon
all.cv %>% 
  group_by(model,day) %>% 
  summarise(rmse=sqrt(mean((y-yhat)^2))) %>% 
  ggplot(.,aes(x=day,y=rmse,group=model,color=model)) +
  geom_line(alpha=.75) + geom_point(alpha=.75)

This is definitely an interesting result. Clearly the Exponential Smoothing model is not the best predictor with this data. Also, when comparing how well each model predicts future events, it appears that the OLS and Random Forest regression models perform just as well as the TBATS, ARIMA, and prophet models. In the plot below, we can also take a look at how each of these forecasted data looks like for the year of 2017.

# Prediction behaviors of different methods
ggplot(all.cv,aes(date,yhat,group=as.factor(cutoff),color=as.factor(cutoff)))+
  geom_line()+
  geom_line(aes(y=y),color="black",alpha=.15)+#geom_point(aes(y=y),color="black",alpha=.15)+
  facet_wrap(~model)+ guides(color="none") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Some of the things that you’ll probably notice first off is:

The reason the Exponental Smoothing model didn’t perform so well.
Since I don’t know the future regressors’ values for OLS and Random Forest regression, I just set them to the values at the end of each initial window, which resulted in straight line forecasts.
ARIMA appears to not be as robust as other methods.

Conclusion

Of all these methods, I would probably decide on either the TBATS or prophet model in forecasting future data. I hope you have enjoyed these exercises and intro to time series in R!

Where to learn more?

References

Hyndman, R.J. and Athanasopoulos, G. (2013) Forecasting: principles and practice. OTexts: Melbourne, Australia. http://otexts.org/fpp/. Accessed on February 11, 2018.

National Center for Environmental Information. Climate Data Online available at https://www.ncdc.noaa.gov/cdo-web. Accessed February 11, 2018.

Sean Taylor and Ben Letham (2017). prophet: Automatic Forecasting Procedure. R package version 0.2.1.9000. https://facebook.github.io/prophet/.

US Environmental Protection Agency. Air Quality System Data Mart [internet database] available at http://www.epa.gov/ttn/airs/aqsdatamart. Accessed?February 11, 2018.

Creating a Custom htmlwidget for Shiny

Tue, 02 Jan 2018 00:00:00 +0000

A year ago, htmlwidgets were a mystery to me. I was first introduced to them at a conference years ago. I previously used rCharts which I really liked because of the ability it gave me to customize my interactive graphs in Shiny. I approached an instructor and explained my interest in rCharts to him and he pointed me in the direction of htmlwidgets. Last year I finally decided to take that leap and give it a try.

Setting Up the HTMLWidget

I started my learning with this tutorial from Ramnath V., Kenton R., and Rstudio on creating htmlwidgets, in which it defines that “the htmlwidgets package provides a framework for creating R bindings to JavaScript libraries.” Following along with this tutorial we see that we can easily create our first htmlwidget.

devtools::create("mywidget")
setwd("mywidget")
htmlwidgets::scaffoldWidget("mywidget")
devtools::install()

One thing to note about htmlwidgets is that they are always hosted in an R package to ensure full reproducibility.

File Structure

Next, let’s follow the tutorial further and take a look at the file structure.


.
├── DESCRIPTION
├── inst
│   └── htmlwidgets
│       ├── mywidget.js
│       └── mywidget.yaml
├── mywidget.Rproj
├── NAMESPACE
└── R
    └── mywidget.R

We see here that in order to bind our JavaScript library to our new R package we need to include both some R code (mywidget.R) and JavaScript (mywidget.js). All the JavaScript, YAML, and other dependencies will be located in the inst\htmlwidgets folder. The R code is located in the R folder which should define the inputs to our new function we are creating. Below you can see the sample htmlwidget we have created takes a character string as input and it will create a html page and pass through our character string to the JavaScript code.

library(mywidget)
mywidget("Hello World",height="100px")

Vioala! Your first htmlwidget AND the classic “Hello World”. Okay, okay… maybe this isn’t as awesome as you were thinking, but we can do even better. Are you ready to create your first htmlwidget?

Step 1: Adding your own JavaScript code

First let’s find some code for the popular JavaScript library D3. I am not a web developer so I found mine in a blog post by Mike Bostock. I really liked the functionality and look of his D3 implementation of hive plots. Hive plots are credited to Martin Krzysinski. You’ll find Martin’s introduction to hive plots here. A simpler version of Mike’s implementation is found here.

Now that I’ve got my code I’m going to replace the JavaScript code in ./inst/htmlwidgets/hive.js with this:


HTMLWidgets.widget({

  name: 'hive_no_int',

  type: 'output',

  factory: function(el, width, height) {

    // TODO: define shared variables for this instance

    return {

      renderValue: function(x) {

        // alias options
        var options = x.options;

        // convert links and nodes data frames to d3 friendly format
        var nodes = HTMLWidgets.dataframeToD3(x.nodes);
        var prelinks = HTMLWidgets.dataframeToD3(x.links);

        // create json of link sources and targets
        var links = [];
        prelinks.forEach(function(d){
          var tmp = {};
          tmp.source=nodes[d.source];
          tmp.target=nodes[d.target];
          links.push(tmp);
        });

        var innerRadius = options.innerRadius,
            outerRadius = options.outerRadius;

        var angle = d3.scale.ordinal().domain(d3.range(x.numAxis+1)).rangePoints([0, 2 * Math.PI]),
            radius = d3.scale.linear().range([innerRadius, outerRadius]),
            color = d3.scale.category10().domain(d3.range(20));

        // select the svg element and remove existing children
        var svg = d3.select(el).append("svg")
          .attr("width", width)
          .attr("height", height)
          .append("g")
          .attr("transform", "translate(" + width / 2 + "," + height / 2 + ")");

        svg.selectAll(".axis")
            .data(d3.range(x.numAxis))
            .enter().append("line")
            .attr("class", "axis")
            .attr("transform", function(d) {
              return "rotate(" + degrees(angle(d)) + ")";
            })
            .attr("x1", radius.range()[0])
            .attr("x2", radius.range()[1]);

        // draw links
        var link = svg.selectAll(".link")
            .data(links)
            .enter().append("path")
            .attr("class", "link")
            .attr("d", d3.hive.link()
              .angle(function(d) { return angle(d.x); })
              .radius(function(d) { return radius(d.y); }))
            .style("stroke", function(d) { return color(d.source.color); })
            .style("stroke-width", 1.5)
            .style("opacity", options.opacity);

        // draw nodes
        var node = svg.selectAll(".node")
            .data(nodes)
            .enter().append("circle")
            .attr("class", "node")
            .attr("transform", function(d) {
              return "rotate(" + degrees(angle(d.x)) + ")";
            })
            .attr("cx", function(d) { return radius(d.y); })
            .attr("r", 5)
            .style("fill", function(d) { return color(d.color); })
            .style("stroke", "#000000");

        function degrees(radians) {
          return radians / Math.PI * 180 - 90;
        }

      }

    };
  }
});

Next, I copy supporting JS and CSS code into ./inst/htmlwidgets/lib/ folder. For this project I’ll need d3.js as well as some code from Mike’s post to create our visualization. Here’s what is now contained in the ./inst/htmlwidgets/lib/ folder:

## d3-3.0/d3.v3.min.js
## hive-0.1/d3.hive.min.js
## hive-0.1/hive.css

And finally, I define those dependencies in ./inst/htmlwidgets/hive.yaml as seen below:


# (uncomment to add a dependency)
dependencies:
  - name: d3
    version: 3.0
    src: htmlwidgets/lib/d3-3.0
    script:
      - d3.v3.min.js
  - name: hive
    version: 0.1
    src: htmlwidgets/lib/hive-0.1
    script:
      - d3.hive.min.js
    stylesheet:
      - hive.css

Now that our dependencies are defined we can now create the bindings between R and JavaScript.

Step 2: Create the Bindings

Okay, the goal in this next step is to get our R dataframe to look just like this d3 dataset from the hive plot D3 code.


var nodes = [
  {x: 0, y: .1},
  {x: 0, y: .9},
  {x: 1, y: .2},
  {x: 1, y: .3},
  {x: 2, y: .1},
  {x: 2, y: .8}
];
var links = [
  {source: nodes[0], target: nodes[2]},
  {source: nodes[1], target: nodes[3]},
  {source: nodes[2], target: nodes[4]},
  {source: nodes[2], target: nodes[5]},
  {source: nodes[3], target: nodes[5]},
  {source: nodes[4], target: nodes[0]},
  {source: nodes[5], target: nodes[1]}
];

First, let’s tell R what it needs to pass through to our JavaScript library. This is done by creating a function that will take our data and options as arguments and combine them into a list. This list is then passed through the htmlwidget::createWidget function to be picked up by our JavaScript code. Below I used code provided in Rstudio’s tutorial and also replicate the options innerRadius, outerRadius, and opacity from Mike Bostock’s function:

hive <- function(nodes, 
                 links, 
                 innerRadius = 40, 
                 outerRadius = 240, 
                 opacity = 0.7, 
                 width = NULL, 
                 height = NULL, 
                 elementId = NULL) {

  # sort in order of node id
  if("id" %in% colnames(nodes)) {
    nodes <- nodes[order(nodes$id),]
    nodes$id <- NULL
  }

  # color by axis if no coloring is supplied
  if(!("color" %in% colnames(nodes))) {
    nodes$color <- nodes$x
  }

  # forward options using x
  x = list(
    nodes = nodes,
    links = links,
    numAxis = max(nodes$x)+1,
    options = list(innerRadius=innerRadius,
                   outerRadius=outerRadius,
                   opacity=opacity)
  )

  # create widget
  htmlwidgets::createWidget(
    name = 'hive',
    x,
    width = width,
    height = height,
    package = 'hiveD3',
    elementId = elementId
  )
}

Notice above that the objects nodes and links are R dataframes and that the final list x is passed through to JS.

Now that we’ve defined our R binding, let’s take a minute and set up the JavaScript binding in the hive.js file. For d3, we use the dataframeToD3() helper function. I’m not awesome with JavaScript, so I’m going to avoid making too many changes to this code:


// alias options
var options = x.options;

// convert links and nodes data frames to d3 friendly format
var nodes = HTMLWidgets.dataframeToD3(x.nodes);
var prelinks = HTMLWidgets.dataframeToD3(x.links);

// create json of link sources and targets
var links = [];
prelinks.forEach(function(d){
  var tmp = {};
  tmp.source=nodes[d.source];
  tmp.target=nodes[d.target];
  links.push(tmp);
});

To give you an understanding of what is under the hood of the dataframeToD3 function, jsonlite::toJSON is used to convert the dataframe to long-form representation. And when you look at the data you can see that recreating nodes is easy. As for links, we read in the data as prelinks then we need to add a loop to loop through each item of prelinks and finally create links just like it is in Mike’s JavaScript code.

Step 3: Putting it all together

All of our bindings are set up and once I’ve built and loaded my package, we’re ready to define some dataframes and test out our new htmlwidget.

library(hiveD3)
nodes = data.frame(id=c(0,1,2,3,4,5,6,7,8),
                   x=c(0,0,1,1,2,2,3,3,4), 
                   y=c(.1,.9,.2,.3,.1,.8,.3,.5,.9))
links = data.frame(source=c(0,1,2,2,3,4,5,6,7,8,8),
                   target=c(2,3,4,5,5,6,7,8,8,0,1))


hive_no_int(nodes=nodes,links=links, width = "700px", height = "500px")

When we run the hive function we see our new visualization! Note that for demonstration purposes only I’ve renamed this first function hive_no_int.

Alright! We’re ready to show off our work, but can you guess the first question that is going to be asked of you? Your friends may think it’s cool, but will say “Why doesn’t it do anything when I hover over it?” or “Why can’t I interact with it?” Well, so much for not having to tweak any JavaScript code. It’s time to dive in and add some interactivity.

Step 4: Making Finishing Touches

Let’s look at some next steps in getting our htmlwidget ready for prime time: - Adding interaction - Creating and sharing your package - Creating R documentation using RStudio and roxygen2 - Adding your package to htmlwidget gallery

We’ve talked about adding interaction, and once that is ready you can share your new package in several ways. Make sure to create helpful documentation for your new package before sharing on Github or on the htmlwidget gallery.

The Final Product

Great! I’ve gone ahead and added my package to GitHub. Of course, I did this after making sure to create some documentation and interactivity… and finally, we can show it off.

library(devtools)
install_github('nielsenmarkus11/hiveD3')

library(hiveD3)

nodes = data.frame(id=c(0,1,2,3,4,5,6,7,8),
                   x=c(0,0,1,1,2,2,3,3,4), 
                   y=c(.1,.9,.2,.3,.1,.8,.3,.5,.9))
links = data.frame(source=c(0,1,2,2,3,4,5,6,7,8,8),
                   target=c(2,3,4,5,5,6,7,8,8,0,1))

hive(nodes=nodes,links=links, width = "700px", height = "500px")

Thanks for taking some time to check out my explorations with htmlwidgets. What are the next steps for your project? Maybe someday I’ll put my stuff out on CRAN, and I definitely want to add some more interactivity and flexibility to my package. You can download and check it out by installing it from my GitHub page. Good luck!

References

Bostock M, Morin R (2012). Hive Plots. Retrieved from https://bost.ocks.org/mike/hive/.
Bostock M (2016). Hive Plot (Links). Retrieved from https://bl.ocks.org/mbostock/2066415.
Bostock M (2017). D3 Data-Driven Documents. Retrieved from https://d3js.org/.
Krzywinski M, Birol I, Jones S, Marra M (2011). Hive Plots — Rational Approach to Visualizing Networks. Briefings in Bioinformatics (early access 9 December 2011, doi: 10.1093/bib/bbr069).
Vaidyanathan R, Russell K, RStudio, Inc. (2014-2015). Creating a widget. Retrieved from http://www.htmlwidgets.org/develop_intro.html.

First Post

Sat, 02 Dec 2017 00:00:00 +0000

Hello World

Yeah! I’ve finally got the blog up and running! Hugo has so far been great to use and easy to learn. There will definitely be more posts on programming and statistics with R to come in the future, so hang tight.

nielsenmark.us

Mixing Up Your Office March Madness Competition

FAQ

Exploring Models with lime

Sample Data

Data Wrangling

Explain the model with LIME

Plotting explanations

Connecting R to PostgreSQL on Linux

1 - RPostgreSQL Package Setup

2 - odbc Package Setup

Set up connection with connection string

Set up connection with DSN

3 - RJDBC Package Setup

Setting up an ODBC connection with MS SQL Server on Windows

Downloading and Installing the Drivers

Setting up a DSN for your ODBC Connection

Install the odbc Package in R

Connecting to the Database from R

Forecasting PM2.5 with forecast and prophet

Data Collection

OLS Regression

Random Forest Regression

Exponential Smoothing

TBATS (Trigonometric regressors, Box-Cox transformations, ARMA errors, Trend, Seasonality)

ARIMA with Regressors (AutoRegressive Integraged Moving Average)

prophet

Cross-Validation Comparison of Models

Conclusion

Where to learn more?

References

Creating a Custom htmlwidget for Shiny

Setting Up the HTMLWidget

File Structure

Step 1: Adding your own JavaScript code

Step 2: Create the Bindings

Step 3: Putting it all together

Step 4: Making Finishing Touches

The Final Product

References

First Post

Blogroll

Install the `odbc` Package in R