<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>nielsenmark.us</title>
    <link>https://nielsenmark.us/</link>
    <description>Recent content on nielsenmark.us</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>Mark Nielsen. Powered by [Hugo](//gohugo.io). Theme by [PPOffice](http://github.com/ppoffice).</copyright>
    <lastBuildDate>Thu, 14 Mar 2019 00:00:00 +0000</lastBuildDate>
    
        <atom:link href="https://nielsenmark.us/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Mixing Up Your Office March Madness Competition</title>
      <link>https://nielsenmark.us/2019/03/14/mixing-up-march-madness/</link>
      <pubDate>Thu, 14 Mar 2019 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/2019/03/14/mixing-up-march-madness/</guid>
      <description>

&lt;p&gt;March Madness is almost here! Try mixing up your office bracket competition with a Calcutta auction instead. I think the original source of this idea came from a news article similar to &lt;a href=&#34;https://www.post-gazette.com/sports/marchmadness/2006/03/13/Calcutta-auction-Brainy-twist-on-traditional-NCAA-pool/stories/200603130129&#34;&gt;this one&lt;/a&gt;. Of course, you can also make this (mostly) risk free by using points to bid on teams instead of money&amp;hellip; our office typically has the biggest loser by the winner lunch.&lt;/p&gt;

&lt;p&gt;Here’s how it works:&lt;/p&gt;

&lt;p&gt;Get everyone together over lunch for 1 &lt;sup&gt;1&lt;/sup&gt;&amp;frasl;&lt;sub&gt;2&lt;/sub&gt; to 2 hours on Monday to bid on the teams that they want to represent them for the tournament.  Each person will start with 500 or so points (for example) to &amp;ldquo;purchase&amp;rdquo; teams at the auction.  Everyone will earn 1 point back for each point their team scores in the NCAA tournament, and the owner of the NCAA tournament championship team will earn 100 bonus points.  The person with the most points after the final will be the winner and gets all the glory. And possibly lunch&amp;hellip; :)&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve tried to make this auction process a little easier using an R Shiny app that you can download using the following commands:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;install.packages(&#39;devtools&#39;)
devtools::install_github(&#39;nielsenmarkus11/NCAAcalcutta&#39;)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once you&amp;rsquo;ve installed the package you&amp;rsquo;ll need to import the latest bracket data (which will be available this upcoming Sunday) and load this into R. Because that data isn&amp;rsquo;t available yet, I&amp;rsquo;ve included example data from 2018.  You&amp;rsquo;ll want to mimic the 2018 csv file to ensure that the app works properly. Once you are ready you can import the teams and start the Calcutta Shiny app.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Input the 2018 teams
teams &amp;lt;- import_teams(system.file(&amp;quot;extdata&amp;quot;, &amp;quot;ncaa-teams.csv&amp;quot;, package = &amp;quot;NCAAcalcutta&amp;quot;))
start_auction(teams, randomize=TRUE)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/NCAAcalcutta-example.png&#34; alt=&#34;Shiny App&#34; /&gt;&lt;/p&gt;

&lt;p&gt;There you go, you&amp;rsquo;re all ready to get started!&lt;/p&gt;

&lt;p&gt;Here are some general guidelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When the timer runs out, the bidding is over&lt;/li&gt;
&lt;li&gt;Only allow bids in 5 point increments&lt;/li&gt;
&lt;li&gt;In the app, each team has a minimum bid,  this rule can be loosened at the end if nobody has points left to spend.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one additional rule I made up to make it fair if someone realized they spent too much after the fact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you bid more than you have, you will have 10 points per overspent point deducted from your final score.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope you all have as much fun with this as our team does.  Comment below for any additional clarification. And happy bidding!&lt;/p&gt;

&lt;h3 id=&#34;faq&#34;&gt;FAQ&lt;/h3&gt;

&lt;p&gt;Q: How do I determine how many points each participant starts with?&lt;/p&gt;

&lt;p&gt;A: Typically it works well to (1) take the total points scored in last years tournament, (2) multiply that by 0.8 and (3) divide by the number of participants. You can probably round up to the nearest 25 points and you should be okay.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Exploring Models with lime</title>
      <link>https://nielsenmark.us/2018/11/09/exploring-models-with-lime/</link>
      <pubDate>Fri, 09 Nov 2018 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/2018/11/09/exploring-models-with-lime/</guid>
      <description>&lt;p&gt;Recently at work I’ve been asked to help some clinicians understand why my risk model classifies specific patients as high risk. Just prior to this work I stumbled across the work of some data scientists at the University of Washington called &lt;code&gt;lime&lt;/code&gt;. &lt;a href=&#34;https://github.com/marcotcr/lime&#34;&gt;LIME&lt;/a&gt; stands for “Local Interpretable Model-Agnostic Explanations”. The idea is that I can answer those questions I’m getting from clinicians for a specific patient by locally fitting a linear (aka “interpretable”) model in the parameter space just around my data point. I decided to pursue &lt;code&gt;lime&lt;/code&gt; as a solution and the last few months I’ve been focusing on implementing this explainer for my risk model. Happily, I also discovered an &lt;a href=&#34;https://github.com/thomasp85/lime&#34;&gt;R package&lt;/a&gt; that implements this solution that originated in python.&lt;/p&gt;
&lt;div id=&#34;sample-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Sample Data&lt;/h3&gt;
&lt;p&gt;So the first step to this blog was to find some public data for illustration. I remembered an example used in an &lt;a href=&#34;http://www-bcf.usc.edu/~gareth/ISL/index.html&#34;&gt;Introduction to Statistical Learning by James, Witten, Hastie and Tibshirani&lt;/a&gt;. I will use the &lt;code&gt;Heart.csv&lt;/code&gt; data which can be downloaded using the link in the code below:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(readr)
library(ranger)
library(tidyverse)
library(lime)

dat &amp;lt;- read_csv(&amp;quot;http://www-bcf.usc.edu/~gareth/ISL/Heart.csv&amp;quot;)
dat$X1 &amp;lt;- NULL&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s take a quick look at the data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Hmisc::describe(dat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## dat 
## 
##  14  Variables      303  Observations
## ---------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       41    0.999    54.44     10.3       40       42 
##      .25      .50      .75      .90      .95 
##       48       56       61       66       68 
## 
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## ---------------------------------------------------------------------------
## Sex 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      303        0        2    0.653      206   0.6799   0.4367 
## 
## ---------------------------------------------------------------------------
## ChestPain 
##        n  missing distinct 
##      303        0        4 
##                                                               
## Value      asymptomatic   nonanginal   nontypical      typical
## Frequency           144           86           50           23
## Proportion        0.475        0.284        0.165        0.076
## ---------------------------------------------------------------------------
## RestBP 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       50    0.995    131.7    19.41      108      110 
##      .25      .50      .75      .90      .95 
##      120      130      140      152      160 
## 
## lowest :  94 100 101 102 104, highest: 174 178 180 192 200
## ---------------------------------------------------------------------------
## Chol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0      152        1    246.7    55.91    175.1    188.8 
##      .25      .50      .75      .90      .95 
##    211.0    241.0    275.0    308.8    326.9 
## 
## lowest : 126 131 141 149 157, highest: 394 407 409 417 564
## ---------------------------------------------------------------------------
## Fbs 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      303        0        2    0.379       45   0.1485   0.2538 
## 
## ---------------------------------------------------------------------------
## RestECG 
##        n  missing distinct     Info     Mean      Gmd 
##      303        0        3     0.76   0.9901    1.003 
##                             
## Value          0     1     2
## Frequency    151     4   148
## Proportion 0.498 0.013 0.488
## ---------------------------------------------------------------------------
## MaxHR 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       91        1    149.6    25.73    108.1    116.0 
##      .25      .50      .75      .90      .95 
##    133.5    153.0    166.0    176.6    181.9 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## ---------------------------------------------------------------------------
## ExAng 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      303        0        2     0.66       99   0.3267   0.4414 
## 
## ---------------------------------------------------------------------------
## Oldpeak 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       40    0.964     1.04    1.225      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.8      1.6      2.8      3.4 
## 
## lowest : 0.0 0.1 0.2 0.3 0.4, highest: 4.0 4.2 4.4 5.6 6.2
## ---------------------------------------------------------------------------
## Slope 
##        n  missing distinct     Info     Mean      Gmd 
##      303        0        3    0.798    1.601   0.6291 
##                             
## Value          1     2     3
## Frequency    142   140    21
## Proportion 0.469 0.462 0.069
## ---------------------------------------------------------------------------
## Ca 
##        n  missing distinct     Info     Mean      Gmd 
##      299        4        4    0.783   0.6722   0.9249 
##                                   
## Value          0     1     2     3
## Frequency    176    65    38    20
## Proportion 0.589 0.217 0.127 0.067
## ---------------------------------------------------------------------------
## Thal 
##        n  missing distinct 
##      301        2        3 
##                                            
## Value           fixed     normal reversable
## Frequency          18        166        117
## Proportion      0.060      0.551      0.389
## ---------------------------------------------------------------------------
## AHD 
##        n  missing distinct 
##      303        0        2 
##                       
## Value         No   Yes
## Frequency    164   139
## Proportion 0.541 0.459
## ---------------------------------------------------------------------------&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our target variable in this data is &lt;code&gt;AHD&lt;/code&gt;. This flag identifies whether or not a patient has &lt;a href=&#34;https://g.co/kgs/hT5ibs&#34;&gt;Coronary Artery Disease&lt;/a&gt;. If we can predict this accurately, clinicians could probably better treat these patients and hopefully help them avoid the symptoms of AHD like chest pain or worse, heart attacks.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-wrangling&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Data Wrangling&lt;/h3&gt;
&lt;p&gt;For a predictive model I’ve opted to use a random forest model using the &lt;code&gt;ranger&lt;/code&gt; implmentation which parallelizes the random forests algorithm in R. But first, some data cleaning is necessary. After replacing missing values, I’m going to split the data into test and training dataframes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Replace missing values
dat$Ca[is.na(dat$Ca)] &amp;lt;- -1
dat$Thal[is.na(dat$Thal)] &amp;lt;- &amp;quot;missing&amp;quot;

## 75% of the sample size
smp_size &amp;lt;- floor(0.75 * nrow(dat))

## set the seed to make your partition reproducible
set.seed(123)
train_ind &amp;lt;- sample(seq_len(nrow(dat)), size = smp_size)

train &amp;lt;- dat[train_ind, ]
test &amp;lt;- dat[-train_ind, ]

mod &amp;lt;- ranger(AHD~., data=train, probability = TRUE, importance = &amp;quot;permutation&amp;quot;)

mod$prediction.error&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.1326235&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our quick and dirty check of the OOB prediction error tells us that our model appears to be doing okay at predicting &lt;code&gt;AHR&lt;/code&gt;. Now the trick is to describe to our physicians and nurses why we believe someone is high risk for &lt;code&gt;AHR&lt;/code&gt;. Before I learned of &lt;code&gt;lime&lt;/code&gt;, I would have probably done something similar to the code below by first looking at which variables were most important in my trees.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_importance &amp;lt;- function(mod){
  tmp &amp;lt;- mod$variable.importance
  dat &amp;lt;- data.frame(variable=names(tmp),importance=tmp)
  ggplot(dat, aes(x=reorder(variable,importance), y=importance))+ 
    geom_bar(stat=&amp;quot;identity&amp;quot;, position=&amp;quot;dodge&amp;quot;)+ coord_flip()+
    ylab(&amp;quot;Variable Importance&amp;quot;)+
    xlab(&amp;quot;&amp;quot;)
}

# Plot the variable importance
plot_importance(mod)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-11-09-exploring-models-with-lime_files/figure-html/plot-importance-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;After this, I probably would have taken a look at some partial dependence plots to get an idea of how those important variables are changing over the range of that variable. However, often the weakness of this approach is that I need to hold all other variables constant. And if I truly believe there are interactions between my variables, the partial dependence plot could change dramatically when other variables are changed.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;explain-the-model-with-lime&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Explain the model with LIME&lt;/h3&gt;
&lt;p&gt;Enter &lt;code&gt;lime&lt;/code&gt;. As discussed above, the entire purpose of &lt;code&gt;lime&lt;/code&gt; is to provide a local interpretable model to help us understand how our prediction would change if we tweak the other variables slightly in a lot of permutations. The first step to using &lt;code&gt;lime&lt;/code&gt; in this specific case is to add some functions so that the &lt;code&gt;lime&lt;/code&gt; package knows how to deal with the output of the &lt;code&gt;ranger&lt;/code&gt; package. Once I have these I can use the combination of the &lt;code&gt;lime()&lt;/code&gt; and &lt;code&gt;explain()&lt;/code&gt; functions to get what I need. As in all multivariate linear models, we still have an issue… correlated explanatory varaibles. And depending on the number of variables in our original model, we may need to pair down our models to only look at the most “influential” or “important” variables. By default lime is going to use either forward-selection or pick the variables with the larges coefficients after correcting for multicollinearity using ridge regression or L2 penalization. As seen below, you can also select variables for the explanation using Lasso (aka L1 penalization) or use &lt;code&gt;xgboost&lt;/code&gt; most important variables using the &lt;code&gt;&amp;quot;tree&amp;quot;&lt;/code&gt; method.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Train LIME Explainer
expln &amp;lt;- lime(train, model = mod)


preds &amp;lt;- predict(mod,train,type = &amp;quot;response&amp;quot;)
# Add ranger to LIME
predict_model.ranger &amp;lt;- function(x, newdata, type, ...) {
  res &amp;lt;- predict(x, data = newdata, ...)
  switch(
    type,
    raw = data.frame(Response = ifelse(res$predictions[,&amp;quot;Yes&amp;quot;] &amp;gt;= 0.5,&amp;quot;Yes&amp;quot;,&amp;quot;No&amp;quot;), stringsAsFactors = FALSE),
    prob = as.data.frame(res$predictions[,&amp;quot;Yes&amp;quot;], check.names = FALSE)
  )
}

model_type.ranger &amp;lt;- function(x, ...) &amp;#39;classification&amp;#39;


reasons.forward &amp;lt;- explain(x=test[,names(test)!=&amp;quot;AHD&amp;quot;], explainer=expln, n_labels = 1, n_features = 4)
reasons.ridge &amp;lt;- explain(x=test[,names(test)!=&amp;quot;AHD&amp;quot;], explainer=expln, n_labels = 1, n_features = 4, feature_select = &amp;quot;highest_weights&amp;quot;)
reasons.lasso &amp;lt;- explain(x=test[,names(test)!=&amp;quot;AHD&amp;quot;], explainer=expln, n_labels = 1, n_features = 4, feature_select = &amp;quot;lasso_path&amp;quot;)
reasons.tree &amp;lt;- explain(x=test[,names(test)!=&amp;quot;AHD&amp;quot;], explainer=expln, n_labels = 1, n_features = 4, feature_select = &amp;quot;tree&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note: Using the current version of &lt;code&gt;lime&lt;/code&gt; you may have issues with the &lt;code&gt;feature_select = &amp;quot;lasso_path&amp;quot;&lt;/code&gt; option. To get the above code to run above you can install my tweaked version of &lt;code&gt;lime&lt;/code&gt; &lt;a href=&#34;https://github.com/nielsenmarkus11/lime&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;plotting-explanations&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Plotting explanations&lt;/h3&gt;
&lt;p&gt;Now that we have all the explanations, one of my favorite features in the &lt;code&gt;lime&lt;/code&gt; package is the &lt;code&gt;plot_explanations()&lt;/code&gt; function. You can easily show the most important variables for each of our selection methods above and we can see that they are all very consistent in the choice of the top 4 most influential variables in predicting &lt;code&gt;AHD&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_explanations(reasons.forward)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-11-09-exploring-models-with-lime_files/figure-html/unnamed-chunk-1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_explanations(reasons.ridge)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-11-09-exploring-models-with-lime_files/figure-html/unnamed-chunk-1-2.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_explanations(reasons.lasso)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-11-09-exploring-models-with-lime_files/figure-html/unnamed-chunk-1-3.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_explanations(reasons.tree)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-11-09-exploring-models-with-lime_files/figure-html/unnamed-chunk-1-4.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Thanks for reading this quick tutorial on &lt;code&gt;lime&lt;/code&gt;. There is much more of this package that I want to explore. Particulary its use for image and text classifications. Then the only real question left is… How do I get one of those cool hex stickers for &lt;code&gt;lime&lt;/code&gt;? ;)&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Connecting R to PostgreSQL on Linux</title>
      <link>https://nielsenmark.us/2018/07/07/connecting-r-to-postgresql-on-linux/</link>
      <pubDate>Sat, 07 Jul 2018 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/2018/07/07/connecting-r-to-postgresql-on-linux/</guid>
      <description>&lt;p&gt;Connecting to databases is a critical piece of data anlaysis in R. In most analytic roles the data we consume is going to be found in databases. Of these some of the most common are SQL databases like MS SQL Server, PostgreSQL, and Oracle in addition to many others. In this how-to blog, I’ll walk you through the major steps of configuring your machine and R to be able to connect to a PostgreSQL Server database from R on Ubuntu using the &lt;code&gt;RPostgreSQL&lt;/code&gt;, &lt;code&gt;odbc&lt;/code&gt;, and &lt;code&gt;RJDBC&lt;/code&gt; packages in R. Similar steps can be followed to set up connections to other databases, however, driver installation and configuration will likely be slightly different.&lt;/p&gt;
&lt;div id=&#34;rpostgresql-package-setup&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;1 - RPostgreSQL Package Setup&lt;/h3&gt;
&lt;p&gt;The first step in setting up a connection to a PostgreSQL database is to first download the PostgreSQL header files and static library, &lt;code&gt;libpq-dev&lt;/code&gt;. In order to do this on Ubuntu open the terminal and install it using the following command:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;sudo apt-get install libpq-dev&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once the &lt;code&gt;libpq-dev&lt;/code&gt; package is installed the next step is to install the &lt;code&gt;RPostgreSQL&lt;/code&gt; package in R. If you need to authenticate, I highly recommend the &lt;code&gt;getPass&lt;/code&gt; package which will prompt you for your password. RStudio also has a &lt;code&gt;.rs.askForPassword()&lt;/code&gt; function that works similar to the &lt;code&gt;getPass()&lt;/code&gt; function, but it relies on using RStudio. I’ve confirmed that &lt;code&gt;getPass&lt;/code&gt; works in bash, emacs, RStudio, and when knitting your Rmd files. So however you submit your R code it will work the same.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Install the package in R
install.packages(&amp;quot;RPostgreSQL&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(RPostgreSQL)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: DBI&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(getPass)
pgdrv &amp;lt;- dbDriver(drvName = &amp;quot;PostgreSQL&amp;quot;)

db &amp;lt;-DBI::dbConnect(pgdrv,
                    dbname=&amp;quot;postgres&amp;quot;,
                    host=&amp;quot;localhost&amp;quot;, port=5432,
                    user = &amp;#39;postgres&amp;#39;,
                    password = getPass(&amp;quot;Enter Password:&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Please enter password in TK window (Alt+Tab)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Write to database
DBI::dbWriteTable(db, &amp;quot;mtcars&amp;quot;, mtcars)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;DBI::dbDisconnect(db)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Perfect! Your database connection should be working simply by adding the proper arguments in your &lt;code&gt;dbConnect()&lt;/code&gt; function. You may need to tweak the host, port and user based on your PostgreSQL server setup.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;odbc-package-setup&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;2 - odbc Package Setup&lt;/h3&gt;
&lt;p&gt;In case you are a fan of &lt;code&gt;odbc&lt;/code&gt;, the next section will walk you through the steps of creating your database connection via &lt;code&gt;odbc&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In the past I have used the &lt;code&gt;RODBC&lt;/code&gt; package but recently I have found that the &lt;code&gt;odbc&lt;/code&gt; package plays much nicer with other database tools like &lt;code&gt;DBI&lt;/code&gt; and &lt;code&gt;dbplyr&lt;/code&gt;. Plus it has very similar syntax to the &lt;code&gt;RJDBC&lt;/code&gt; package and for consistency sake I’ve made the switch.&lt;/p&gt;
&lt;p&gt;Once again, the first step is to install the necessary debian packages. In this case we need to install the &lt;code&gt;unixodbc&lt;/code&gt; and &lt;code&gt;unixodbc-dev&lt;/code&gt; packages and the &lt;code&gt;odbc-postgresql&lt;/code&gt; driver.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;# Install the unixODBC library
apt-get install unixodbc unixodbc-dev

# PostgreSQL ODBC Drivers
apt-get install odbc-postgresql&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;set-up-connection-with-connection-string&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Set up connection with connection string&lt;/h4&gt;
&lt;p&gt;Okay we are now ready to connect via odbc. Note the slight difference in the names of the arguments of the &lt;code&gt;dbConnect()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;db &amp;lt;- DBI::dbConnect(odbc::odbc(),
                     Driver = &amp;quot;PostgreSQL Unicode&amp;quot;,
                     Database = &amp;quot;postgres&amp;quot;,
                     UserName = &amp;quot;postgres&amp;quot;,
                     Password = getPass(&amp;quot;Enter Password:&amp;quot;),
                     Servername = &amp;quot;localhost&amp;quot;,
                     Port = 5432)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Please enter password in TK window (Alt+Tab)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;set-up-connection-with-dsn&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Set up connection with DSN&lt;/h4&gt;
&lt;p&gt;If you don’t want to have to worry about defining each of these arguments each time you connect to PostgreSQL via odbc, you can define the configuration in your &lt;code&gt;odbcinst.ini&lt;/code&gt; file. The following steps walk you through the process:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;Make sure the &lt;code&gt;/etc/odbcinst.ini&lt;/code&gt; has the drivers set up. This should have been configured automatically when installing &lt;code&gt;odbc-postgresql&lt;/code&gt; with &lt;code&gt;apt-get&lt;/code&gt;. This is what it would look like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[PostgreSQL Unicode]
Driver = psqlodbca.so
Setup = libodbcpsqlS.so
Debug = 0
CommLog = 1
UsageCount = 1&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Now define your DSN by modifying the odbc.ini file:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[PostgreSQL]
Driver = PostgreSQL Unicode
Database = postgres
Servername = localhost
UserName = postgres
Password = postgres&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Connect to your database by referencing your DSN name specified in the square brackets of the &lt;code&gt;odbc.ini&lt;/code&gt; file:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Connect to the database
db &amp;lt;- dbConnect(odbc::odbc(), &amp;quot;PostgreSQL&amp;quot;)

# Pull the Data into an R dataframe
DBI::dbGetQuery(db,&amp;quot;SELECT * FROM MTCARS&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##              row.names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Close the Connection
DBI::dbDisconnect(db)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now you are ready to begin your analysis with your data!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;rjdbc-package-setup&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;3 - RJDBC Package Setup&lt;/h3&gt;
&lt;p&gt;Finally, the last way to configure a connection to the PostgreSQL database can be done via the &lt;code&gt;RJDBC&lt;/code&gt; package. The first step in this configuration is to download the jdbc jar file from &lt;a href=&#34;https://jdbc.postgresql.org/download.html&#34;&gt;here&lt;/a&gt;. I’ve put this in my home directory, &lt;code&gt;~&lt;/code&gt;, and will reference this file in the &lt;code&gt;JDBC()&lt;/code&gt; function below. Once you have the jar file you can install the &lt;code&gt;RJDBC&lt;/code&gt; package in R.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;#39;RJDBC&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now you are ready to connect. Once again, notice the slight tweaks to the arguments of the &lt;code&gt;dbConnect()&lt;/code&gt; function. Because I’m defining the &lt;code&gt;url&lt;/code&gt; argument with the host, port and database name, there is no need for these additional arguments.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(RJDBC)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: rJava&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;db &amp;lt;- DBI::dbConnect(RJDBC::JDBC(&amp;quot;org.postgresql.Driver&amp;quot;,&amp;quot;~/postgresql-42.2.2.jar&amp;quot;),
               url = &amp;quot;jdbc:postgresql://localhost:5432/postgres&amp;quot;,
               user = &amp;quot;postgres&amp;quot;,
               password = getPass(&amp;quot;Enter Password:&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Please enter password in TK window (Alt+Tab)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Pull the Data into an R dataframe
DBI::dbGetQuery(db,&amp;quot;SELECT * FROM MTCARS&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##              row.names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1            Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## 2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## 3           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## 4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## 5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 6              Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## 7           Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 8            Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 9             Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## 10            Merc 280 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## 11           Merc 280C 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## 12          Merc 450SE 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 13          Merc 450SL 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 14         Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 15  Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 16 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 17   Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 18            Fiat 128 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 19         Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 20      Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 21       Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## 22    Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 23         AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 24          Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 25    Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 26           Fiat X1-9 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 27       Porsche 914-2 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 28        Lotus Europa 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## 29      Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 30        Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## 31       Maserati Bora 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## 32          Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Close the Connection
DBI::dbDisconnect(db)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] TRUE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alright! We’ve walked through several different configurations in connecting to a PostgreSQL database on Ubuntu. You’ll only need one of these setups, but I think it’s nice to understand each of your options so you can create the best setup that works for you and/or your organization.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Setting up an ODBC connection with MS SQL Server on Windows</title>
      <link>https://nielsenmark.us/2018/06/01/odbc-ms-sql-server-windows/</link>
      <pubDate>Fri, 01 Jun 2018 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/2018/06/01/odbc-ms-sql-server-windows/</guid>
      <description>&lt;p&gt;Connecting to databases is a critical piece of data anlaysis in R. In most analytic roles the data we consume is going to be found in databases. Of these some of the most common are SQL databases like MS SQL Server, PostgreSQL, and Oracle in addition to many others. In this how-to blog, I’ll walk you through the major steps of configuring your machine and R to be able to connect to a MS SQL Server database from R on Windows. Similar steps can be followed to set up connections to other databases, however, driver installation and configuration will likely be slightly different.&lt;/p&gt;
&lt;div id=&#34;downloading-and-installing-the-drivers&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Downloading and Installing the Drivers&lt;/h3&gt;
&lt;p&gt;The first step is to download the necessary odbc drivers for your database. Because most Windows installations come with the MS SQL Server drivers installed we’ll breeze over this step. If you don’t have it installed you can follow these directions &lt;a href=&#34;https://docs.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-2017&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;setting-up-a-dsn-for-your-odbc-connection&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Setting up a DSN for your ODBC Connection&lt;/h3&gt;
&lt;p&gt;This step is not necessary, but I have found that configuring a DSN (aka. “Data Source Name”) can simplify your code configuration in R.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;STEP 1:&lt;/strong&gt; Search “ODBC” in the Start Menu search and open “ODBC Data Source Administrator (64-bit)”.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Select “Add” under the “User DSN” tab.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/ODBC%201.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; Select the corresponding ODBC driver for which you wish to set up a data source and Click “Finish”.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/ODBC%202.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4:&lt;/strong&gt; Give your DSN a “Name” and “Server” name/IP address and click “Next”.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/ODBC%203.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 5:&lt;/strong&gt; Define your default database and click “Next”.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/ODBC%204b.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 6:&lt;/strong&gt; Click “Next” through any remaining windows, then click “Finish”. A window should pop up to test the connection. Double check your options then click “Test Data Source”.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/ODBC%204.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 7:&lt;/strong&gt; If it was successful it should give you the following message. Click “OK”.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/ODBC%205.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 8:&lt;/strong&gt; Finally you should see your newly defined DSN listed under the “User DSN” tab. Click “OK” to exit the ODBC DSN configuration tool.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/img/ODBC%206.PNG&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;install-the-odbc-package-in-r&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Install the &lt;code&gt;odbc&lt;/code&gt; Package in R&lt;/h3&gt;
&lt;p&gt;In the past I have used the &lt;code&gt;RODBC&lt;/code&gt; package but recently I have found that the &lt;code&gt;odbc&lt;/code&gt; package plays much nicer with other database tools like &lt;code&gt;DBI&lt;/code&gt; and &lt;code&gt;dbplyr&lt;/code&gt;. Plus it has very similar syntax to the &lt;code&gt;RJDBC&lt;/code&gt; package and for consistency sake I’ve made the switch.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;#39;odbc&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;connecting-to-the-database-from-r&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Connecting to the Database from R&lt;/h3&gt;
&lt;p&gt;Alright, we are ready to make our connection… drum-roll please. To start let’s make our connection using the DNS configuration we set up earlier.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(odbc)
library(dplyr)
library(dbplyr)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Connect using the DSN
db &amp;lt;- DBI::dbConnect(odbc::odbc(), &amp;quot;SQL&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That was easy! Now we’re ready to roll with our data. If you opted out of creating a DSN, the below code is what you would use to connect. There are a lot more key strokes but the bonus is that there is no additional setup needed outside of R, which can be handy when you are trying to share your code with coworkers that want to connect to the database too.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Connect without a DSN
db &amp;lt;- DBI::dbConnect(odbc::odbc(),
                     Driver = &amp;#39;ODBC Driver 13 for SQL Server&amp;#39;,
                     Server = &amp;#39;localhost\\SQLEXPRESS&amp;#39;,
                     Database = &amp;quot;master&amp;quot;,
                     trusted_connection = &amp;#39;yes&amp;#39;,
                     Port = 1433
                     )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Okay, now that we are connected we are ready to get started on our analysis. We can read/write data to the database using the follwing commands:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Write iris data to MS SQL Server
# DBI::dbWriteTable(db,&amp;quot;iris&amp;quot;,iris)

# Read data from MS SQL Server
my.iris &amp;lt;- DBI::dbGetQuery(db,&amp;quot;SELECT * FROM IRIS&amp;quot;)
head(my.iris)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, use the dbplyr package to extend the dplyr functions to our database connection.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;smry &amp;lt;- tbl(db,&amp;quot;iris&amp;quot;) %&amp;gt;% collect
head(smry)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Don&amp;#39;t forget to disconnect
dbDisconnect(db)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Forecasting PM2.5 with forecast and prophet</title>
      <link>https://nielsenmark.us/2018/02/21/forecasting-pm2-5-with-forecast-and-prophet/</link>
      <pubDate>Wed, 21 Feb 2018 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/2018/02/21/forecasting-pm2-5-with-forecast-and-prophet/</guid>
      <description>&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/htmlwidgets/htmlwidgets.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/jquery/jquery.min.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet/leaflet.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet/leaflet.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://nielsenmark.us/rmarkdown-libs/leafletfix/leafletfix.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;link href=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet-label/leaflet.label.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet-label/leaflet.label.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/Proj4Leaflet/proj4-compressed.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/Proj4Leaflet/proj4leaflet.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet-binding/leaflet.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet-providers/leaflet-providers.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet-providers-plugin/leaflet-providers-plugin.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet-awesomemarkers/leaflet.awesome-markers.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/leaflet-awesomemarkers/leaflet.awesome-markers.min.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://nielsenmark.us/rmarkdown-libs/bootstrap/bootstrap.min.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/bootstrap/bootstrap.min.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Time series, the course I often wish I had taken while completing my coursework in school. I finally got an excuse to do a comparitive dive into the different time series models in the &lt;code&gt;forecast&lt;/code&gt; package in R thanks to an invitation to present at a recent Practical Data Science Meetup in Salt Lake City.&lt;/p&gt;
&lt;p&gt;In the following exercises, I’ll be comparing OLS and Random Forest Regression to the time series models available in the &lt;code&gt;forecast&lt;/code&gt; package. In addition to this I’ll be taking a look at the fairly new &lt;code&gt;prophet&lt;/code&gt; package released by facebook for R. Alright, let’s load some packages to get started.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(gridExtra)
library(lubridate)
library(leaflet)
library(randomForest)
library(forecast)
library(prophet)

load(&amp;quot;../../../time-series/data/ts-dat.Rdat&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;data-collection&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Data Collection&lt;/h3&gt;
&lt;p&gt;The pollution data I’ll be using for this examples comes from epa.gov and the weather data comes from ncdc.noaa.gov. You can access my R data object on &lt;a href=&#34;https://github.com/nielsenmarkus11/time-series&#34;&gt;my github&lt;/a&gt; page. Salt Lake City for many years has experienced population growth which has exasterbated the inversion problem. Inversion creates a “cap” over Utah valleys trapping pollutants on the valley floors which creates many public health issues because of the thick smog.&lt;/p&gt;
&lt;p&gt;Below is an map indicating 4 sites where data is being collected on pollution levels. I will be focusing particulary on PM2.5 measures across the Salt Lake Valley. I’ve also downloaded weather data from both the valley floor at SLC International Airport and in a meadow near Grand View peak in the Wasatch mountains. These two sites’ temperatures can be used to compute whether the temperatures are inverted.&lt;/p&gt;
&lt;div id=&#34;htmlwidget-1&#34; style=&#34;width:672px;height:480px;&#34; class=&#34;leaflet html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-1&#34;&gt;{&#34;x&#34;:{&#34;options&#34;:{&#34;crs&#34;:{&#34;crsClass&#34;:&#34;L.CRS.EPSG3857&#34;,&#34;code&#34;:null,&#34;proj4def&#34;:null,&#34;projectedBounds&#34;:null,&#34;options&#34;:{}}},&#34;calls&#34;:[{&#34;method&#34;:&#34;addProviderTiles&#34;,&#34;args&#34;:[&#34;Esri.NatGeoWorldMap&#34;,null,null,{&#34;errorTileUrl&#34;:&#34;&#34;,&#34;noWrap&#34;:false,&#34;zIndex&#34;:null,&#34;unloadInvisibleTiles&#34;:null,&#34;updateWhenIdle&#34;:null,&#34;detectRetina&#34;:false,&#34;reuseTiles&#34;:false}]},{&#34;method&#34;:&#34;addAwesomeMarkers&#34;,&#34;args&#34;:[[40.83,40.7781],[-111.76,-111.9694],{&#34;icon&#34;:&#34;tint&#34;,&#34;markerColor&#34;:&#34;white&#34;,&#34;iconColor&#34;:&#34;blue&#34;,&#34;spin&#34;:false,&#34;squareMarker&#34;:false,&#34;iconRotate&#34;:0,&#34;font&#34;:&#34;monospace&#34;,&#34;prefix&#34;:&#34;glyphicon&#34;},null,null,{&#34;clickable&#34;:true,&#34;draggable&#34;:false,&#34;keyboard&#34;:true,&#34;title&#34;:&#34;&#34;,&#34;alt&#34;:&#34;&#34;,&#34;zIndexOffset&#34;:0,&#34;opacity&#34;:1,&#34;riseOnHover&#34;:false,&#34;riseOffset&#34;:250},null,null,null,null,null,null,null]},{&#34;method&#34;:&#34;addAwesomeMarkers&#34;,&#34;args&#34;:[[40.708611,40.736389,40.78422],[-112.094722,-111.872222,-111.931],{&#34;icon&#34;:&#34;cloud&#34;,&#34;markerColor&#34;:&#34;black&#34;,&#34;iconColor&#34;:&#34;gray&#34;,&#34;spin&#34;:false,&#34;squareMarker&#34;:false,&#34;iconRotate&#34;:0,&#34;font&#34;:&#34;monospace&#34;,&#34;prefix&#34;:&#34;glyphicon&#34;},null,null,{&#34;clickable&#34;:true,&#34;draggable&#34;:false,&#34;keyboard&#34;:true,&#34;title&#34;:&#34;&#34;,&#34;alt&#34;:&#34;&#34;,&#34;zIndexOffset&#34;:0,&#34;opacity&#34;:1,&#34;riseOnHover&#34;:false,&#34;riseOffset&#34;:250},null,null,null,null,[&#34;490351001&#34;,&#34;490353006&#34;,&#34;490353010&#34;],null,null]}],&#34;setView&#34;:[[40.7,-112],10,[]],&#34;limits&#34;:{&#34;lat&#34;:[40.708611,40.83],&#34;lng&#34;:[-112.094722,-111.76]}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;/div&gt;
&lt;div id=&#34;ols-regression&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;OLS Regression&lt;/h3&gt;
&lt;p&gt;First, let’s take a look at how well our weather regressors are at predicting PM2.5 levels without considering autocorrelation or seasonality. Below, we will fit our model and look at our residuals to make sure our assumptions of normality and independence are met:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit1 &amp;lt;- lm(sqrt(pm2.5)~inversion+wind+precip+fireworks,data=dat)
summary(fit1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## lm(formula = sqrt(pm2.5) ~ inversion + wind + precip + fireworks, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1571 -0.5555 -0.1835  0.3608  4.4629 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)  3.322431   0.066358  50.068  &amp;lt; 2e-16 ***
## inversion    2.527237   0.130122  19.422  &amp;lt; 2e-16 ***
## wind        -0.040543   0.003255 -12.454  &amp;lt; 2e-16 ***
## precip      -0.515741   0.175563  -2.938  0.00336 ** 
## fireworks    0.545624   0.116089   4.700 2.85e-06 ***
## ---
## Signif. codes:  0 &amp;#39;***&amp;#39; 0.001 &amp;#39;**&amp;#39; 0.01 &amp;#39;*&amp;#39; 0.05 &amp;#39;.&amp;#39; 0.1 &amp;#39; &amp;#39; 1
## 
## Residual standard error: 0.8791 on 1456 degrees of freedom
## Multiple R-squared:  0.3165, Adjusted R-squared:  0.3146 
## F-statistic: 168.5 on 4 and 1456 DF,  p-value: &amp;lt; 2.2e-16&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat$resid[!is.na(dat$pm2.5)] &amp;lt;- resid(fit1)

# Plot the residuals
ggplot(dat,aes(date,resid)) + 
  geom_point() + geom_smooth() +
  ggtitle(&amp;quot;Linear Regression Residuals&amp;quot;,
          subtitle = paste0(&amp;quot;RMSE: &amp;quot;,round(sqrt(mean(dat$resid^2,na.rm=TRUE)),2)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/reg-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Okay, so when we review the model we see that the variables are somewhat useful in predicting PM2.5 levels, however our r-squared values are not that impressive. Also, looking at our residuals, we can see that there is still something going on that we haven’t accounted for. There appears to be a yearly pattern in the residuals. As for investigating dependece between the PM2.5 data points, let’s use the autocorrelation function, &lt;code&gt;Acf()&lt;/code&gt; available in the &lt;code&gt;forecast&lt;/code&gt; package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;Acf(dat$resid, main=&amp;quot;ACF of OLS Residuals&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/acf-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Here we can see that the data is correlated up through 20 or more days in the past. This definitely violates our assumption of independence.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;random-forest-regression&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Random Forest Regression&lt;/h3&gt;
&lt;p&gt;Random Forest models don’t have as many assumptions as OLS Regression, so let’s try this model to see if we can do any better. Initially I’ll be using the training Root Mean Squared Errors (RMSE) to compare models. However, later I will use time series cross-validation RMSE to compare each of the methods ability to predict future PM2.5 levels.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit2 &amp;lt;- randomForest(sqrt(pm2.5)~inversion+wind+precip+fireworks,data=dat[!is.na(dat$pm2.5),], ntree=500)
dat$rf.resid[!is.na(dat$pm2.5)] &amp;lt;- fit2$predicted - sqrt(dat$pm2.5[!is.na(dat$pm2.5)])

# Plot the residuals
ggplot(dat,aes(date,rf.resid)) + 
  geom_point() + geom_smooth() +
  ggtitle(&amp;quot;Random Forest Residuals&amp;quot;,
          subtitle = paste0(&amp;quot;RMSE: &amp;quot;,round(sqrt(fit2$mse[500]),2)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/rf-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Better but we still have some odd things going on in our data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once again, after looking at the residuals it still looks like something is going on here. We notice that there still appears to be a seasonal trend in our residuals. Let’s zoom in on the residual plots over time and take a look:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Zoom In
p1 &amp;lt;- ggplot(dat,aes(date,rf.resid)) + 
  geom_point() + geom_line() +
  xlim(as.Date(c(&amp;quot;2014-01-01&amp;quot;,&amp;quot;2014-02-28&amp;quot;))) + 
  geom_abline(slope=0, intercept = 0, lty=2, col = &amp;quot;blue&amp;quot;, lwd = 1.25)

p2 &amp;lt;- ggplot(dat,aes(date,rf.resid)) + 
  geom_point() + geom_line() +
  xlim(as.Date(c(&amp;quot;2017-11-01&amp;quot;,&amp;quot;2017-12-31&amp;quot;))) + 
  geom_abline(slope=0, intercept = 0, lty=2, col = &amp;quot;blue&amp;quot;, lwd = 1.25)


grid.arrange(p1, p2, ncol=2, top=&amp;quot;Zoom-in of Random Forest Residuals&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/rf-zoom-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you look closely it appears that the residuals are all negative for a time then they move to be all positive. From this we see that we still haven’t adjusted our model for the autocorrelation. To do this we’ll need to take a look at some time series models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;exponential-smoothing&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Exponential Smoothing&lt;/h3&gt;
&lt;p&gt;Okay, let’s get started with one of the more simple time series models, Exponential Smoothing. This is done by first converting our target column to a time series object using the &lt;code&gt;ts()&lt;/code&gt; function. The &lt;code&gt;ts()&lt;/code&gt; function also allows us to include a seasonal component to our data. We’ll start by setting &lt;code&gt;frequency = 7&lt;/code&gt; to include weekly seasonality in our daily PM2.5 measures. In this exercise, I will be fitting 3 different models. The default &lt;code&gt;model&lt;/code&gt; argument is set to &lt;code&gt;&#39;ZZZ&#39;&lt;/code&gt; which will choose additive (&lt;code&gt;&#39;A&#39;&lt;/code&gt;), multiplicative (&lt;code&gt;&#39;M&#39;&lt;/code&gt;), or none (&lt;code&gt;&#39;N&#39;&lt;/code&gt;) for each of the errors, trend, and seasonality. Our automated model has chosen &lt;code&gt;&#39;MAN&#39;&lt;/code&gt;. Notice that this essentially removed the weekly seasonality which can be seen in the forecast below. I also fit models using all additive and all multiplicative for comparison.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Convert to time series data
dat.ts &amp;lt;- sqrt(ts(dat[,&amp;quot;pm2.5&amp;quot;], frequency = 7))

# Exponential smoothing model with weekly seasonality
fit3 &amp;lt;- ets(dat.ts) # model = &amp;quot;MAN&amp;quot;
fit4a &amp;lt;- ets(dat.ts, model =&amp;quot;AAA&amp;quot;)
fit4b &amp;lt;- ets(dat.ts, model =&amp;quot;MMM&amp;quot;)
# Fit models with all additive or all multiplicative features. First byte is for errors, second for trend, and third for seasonality&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that similar to linear models, the &lt;code&gt;predict()&lt;/code&gt; function is available but can also be used to forecast future values based on previous values by adding an argument for the horizon, &lt;code&gt;h&lt;/code&gt;. Below, I’m using the automated &lt;code&gt;ets&lt;/code&gt; model to predict 25 days into the future:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Predict Future Values
plot(predict(fit3,h=25),xlim=c(200,215))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/ets-forecast-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now going back to our 3 models, we can take a look at the residuals now that we are adjusting for autocorrelation and weekly seasonality:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ets.mod &amp;lt;- rbind(data.frame(day=1:sum(!is.na(dat.ts)), resid=as.numeric(residuals(fit3)), type=&amp;quot;Auto&amp;quot;),
                 data.frame(day=1:sum(!is.na(dat.ts)), resid=as.numeric(residuals(fit4a)), type=&amp;quot;Additive&amp;quot;),
                 data.frame(day=1:sum(!is.na(dat.ts)), resid=as.numeric(residuals(fit4b)), type=&amp;quot;Multiplicative&amp;quot;))

# Compare the residuals of each model
ggplot(ets.mod,aes(day,resid)) + 
  geom_point() + geom_smooth() + 
  facet_grid(type~.,scales=&amp;quot;free&amp;quot;)+
  ggtitle(&amp;quot;ETS Residuals with Weekly Seasonality&amp;quot;,
          subtitle = paste0(&amp;quot;Auto RMSE: &amp;quot;,round(sqrt(fit3$mse),2),
                            &amp;quot;   Additive RMSE: &amp;quot;,round(sqrt(fit4a$mse),2),
                            &amp;quot;   Multiplicative RMSE: &amp;quot;,round(sqrt(fit4b$mse),2)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/ets-resid-1.png&#34; width=&#34;672&#34; /&gt; There we go! Our residuals look much better, there still does appear to be some yearly seasonality that we can incorporate using some more sophisticated time series models. Let’s start with Rob Hyndman’s implementation of the TBATS model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tbats-trigonometric-regressors-box-cox-transformations-arma-errors-trend-seasonality&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;TBATS (Trigonometric regressors, Box-Cox transformations, ARMA errors, Trend, Seasonality)&lt;/h3&gt;
&lt;p&gt;Using the TBATS model is one way to incorporate multiple seasonality in our model. It’s going to automate the process of choosing a Box-Cox transformation for our target variable, PM2.5. You may have noticed that I’ve been taking the square root of PM2.5 in each of our previous models and this in part was due to the recommended Box-Cox parameter of 0.5 that came out of this model when I was first playing around with the &lt;code&gt;tbats()&lt;/code&gt; function. This function will also automatically choose the parameters for the ARMA model and the fourier transforms for the seasonal trends.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# TBATS model with weekly and yearly seasonality
dat.ts2 &amp;lt;- sqrt(msts(dat[!is.na(dat$pm2.5),&amp;quot;pm2.5&amp;quot;], seasonal.periods=c(7,365.25)))
fit5 &amp;lt;- tbats(dat.ts2)
# This method takes the most time when comparing run time.
# Down side on this is that you cannot set specific box-cox, ARMA, and fourier parameters.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This time series model is easy to use and can be extremely useful when modeling mutiple seasonality and autoregressive features. I do wish the &lt;code&gt;tbats()&lt;/code&gt; function would allow you to pass specific Box-Cox, ARMA, and fourier parameters for your model. This would make cross-validation of my models more convenient by allowing me to be able to set the specific model for each window.&lt;/p&gt;
&lt;p&gt;Once again, you can see that predicting future values is made very easy with the &lt;code&gt;predict()&lt;/code&gt; function and &lt;code&gt;h&lt;/code&gt; parameter.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Predict future values
plot(predict(fit5, h=25),xlim=c(4.8,5.2))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/tbats-predict-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Lastly, let’s look at the residuals and see if adding both yearly and weekly seasonality have improved our predictions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Plot the residuals
tbats.mod &amp;lt;- data.frame(day=1:sum(!is.na(dat.ts2)),resid=as.numeric(residuals(fit5)))
ggplot(tbats.mod,aes(day,resid)) + 
  geom_point() + geom_smooth() + 
  ggtitle(&amp;quot;TBATS Resids with Dual Seasonality&amp;quot;,
          subtitle = paste0(&amp;quot;Auto RMSE: &amp;quot;,round(sqrt(mean((residuals(fit5))^2)),2)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/tbats-resid-1.png&#34; width=&#34;672&#34; /&gt; Wow! This looks much better. This random cloud of data around the line &lt;code&gt;y = 0&lt;/code&gt; is typically what we are looking for in a good model fit. Notice also that the training RMSE is much better for this model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;arima-with-regressors-autoregressive-integraged-moving-average&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;ARIMA with Regressors (AutoRegressive Integraged Moving Average)&lt;/h3&gt;
&lt;p&gt;The last piece to time series models is being able to add regressors to the multiple seasonality and autocorrelation adjustments. The &lt;code&gt;auto.arima()&lt;/code&gt; function can have all of these included in the model by using the &lt;code&gt;fourier()&lt;/code&gt; transform function and the &lt;code&gt;xreg&lt;/code&gt; argument.&lt;/p&gt;
&lt;p&gt;In this portion of the exercise, because my regressors are also time series I need to make sure that I also forcast each of those regressors before using them to forecast the PM2.5 level.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# ARIMA with weekly and yearly seasonality with regressors
regs &amp;lt;- dat[!is.na(dat$pm2.5),c(&amp;quot;precip&amp;quot;,&amp;quot;wind&amp;quot;,&amp;quot;inversion&amp;quot;,&amp;quot;fireworks&amp;quot;)]

# Forecast weather regressors
weather.ts &amp;lt;- msts(dat[,c(&amp;quot;precip&amp;quot;,&amp;quot;wind&amp;quot;,&amp;quot;inversion_diff&amp;quot;)],seasonal.periods = c(7,365.25))
precip &amp;lt;- auto.arima(weather.ts[,1])
fprecip &amp;lt;- as.numeric(data.frame(forecast(precip,h=25))$Point.Forecast)
wind &amp;lt;- auto.arima(weather.ts[,2])
fwind &amp;lt;- as.numeric(data.frame(forecast(wind,h=25))$Point.Forecast)
inversion &amp;lt;- auto.arima(weather.ts[,3])
finversion &amp;lt;- as.numeric(data.frame(forecast(inversion,h=25))$Point.Forecast)

fregs &amp;lt;- data.frame(precip=fprecip,wind=fwind,inversion=as.numeric(finversion&amp;lt;0),fireworks=0)

# Seasonality
z &amp;lt;- fourier(dat.ts2, K=c(2,5))
zf &amp;lt;- fourier(dat.ts2, K=c(2,5), h=25)

# Fit the model
fit &amp;lt;- auto.arima(dat.ts2, xreg=cbind(z,regs), seasonal=FALSE)

# Predict Future Values
# This time we need future values of the regressors as well.
fc &amp;lt;- forecast(fit, xreg=cbind(zf,fregs), h=25)
plot(fc,xlim=c(4.8,5.2))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/unnamed-chunk-1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Again, the residuals do look much better than our residuals from the OLS and Random Forest Regression models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Plot the residuals
arima.mod &amp;lt;- data.frame(day=1:sum(!is.na(dat.ts)),resid=as.numeric(residuals(fit)))

ggplot(arima.mod,aes(day,resid)) + 
  geom_point() + geom_smooth() + 
  ggtitle(&amp;quot;Arima Resids with Seasonality and Regressors&amp;quot;,
          subtitle = paste0(&amp;quot;RMSE: &amp;quot;,round(sqrt(mean((residuals(fit))^2)),2)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/arima-resid-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;prophet&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;prophet&lt;/h3&gt;
&lt;p&gt;And finally, let’s take a look at fitting a basic model using the &lt;code&gt;prophet&lt;/code&gt; package. The &lt;code&gt;prophet&lt;/code&gt; package is using STAN to to fit an additive model by including seasonality, autocorrelation, extra regressors, etc. One of the nice features of the &lt;code&gt;prophet()&lt;/code&gt; function is that it will also automatically choose change points in your time series. The default number of change points is set to &lt;code&gt;25&lt;/code&gt;. This allows the time series models to be a little bit more robust in comparison to other models. Once again, I’m also using the &lt;code&gt;prophet()&lt;/code&gt; forecast function to forecast my regressors that I’m passing into the final &lt;code&gt;prophet&lt;/code&gt; model to predict PM2.5.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pdat &amp;lt;- data.frame(ds=dat$date,
                   y=sqrt(dat$pm2.5),
                   precip=dat$precip,
                   wind=dat$wind,
                   inversion_diff=dat$inversion_diff,
                   inversion=dat$inversion_,
                   fireworks=dat$fireworks)

# Forecast weather regressors
pfdat &amp;lt;- data.frame(ds=max(dat$date) + 1:25)
pprecip &amp;lt;- pdat %&amp;gt;% 
  select(ds,y=precip) %&amp;gt;% 
  prophet() %&amp;gt;%
  predict(pfdat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Initial log joint probability = -5.77805
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pwind &amp;lt;- pdat %&amp;gt;% 
  select(ds,y=wind) %&amp;gt;% 
  prophet() %&amp;gt;%
  predict(pfdat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Initial log joint probability = -46.5575
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pinversion &amp;lt;- pdat %&amp;gt;% 
  select(ds,y=inversion_diff) %&amp;gt;% 
  prophet() %&amp;gt;%
  predict(pfdat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Initial log joint probability = -55.0515
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fdat &amp;lt;-  data.frame(ds=pfdat$ds,
                    precip=pprecip$yhat,
                    wind=pwind$yhat,
                    inversion=as.numeric(pinversion$yhat&amp;lt;0),
                    fireworks = 0)

# Fit the model (Seasonality automatically determined)
fit6 &amp;lt;- prophet() %&amp;gt;% 
  add_regressor(&amp;#39;precip&amp;#39;) %&amp;gt;% 
  add_regressor(&amp;#39;wind&amp;#39;) %&amp;gt;% 
  add_regressor(&amp;#39;inversion&amp;#39;) %&amp;gt;% 
  add_regressor(&amp;#39;fireworks&amp;#39;) %&amp;gt;% 
  fit.prophet(pdat)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Initial log joint probability = -120.752
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We also see that the predict funtion can also be used with the &lt;code&gt;prophet&lt;/code&gt; model object to forecast future values by adding the future dataframe as a second argument to the &lt;code&gt;predict()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Forecast future values
forecast &amp;lt;- predict(fit6, fdat)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looking at the residuals below, you can see that we’re starting to see some of the original seasonal trend showing slightly in the residuals that we saw previously in the OLS and Random Forest models.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get the residuals
fpred &amp;lt;- predict(fit6)
fpred$ds &amp;lt;- as.Date(fpred$ds)
fpred &amp;lt;- pdat %&amp;gt;% left_join(fpred,by=&amp;quot;ds&amp;quot;)
fpred$resid &amp;lt;- fpred$y - fpred$yhat

# Plot the residuals
ggplot(fpred,aes(ds,resid)) + 
  geom_point() + geom_smooth() + 
  ggtitle(&amp;quot;Prophet with Seasonality and Regressors&amp;quot;,
          subtitle = paste0(&amp;quot;RMSE: &amp;quot;,round(sqrt(mean(fpred$resid^2)),2)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/prophet-resid-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;cross-validation-comparison-of-models&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Cross-Validation Comparison of Models&lt;/h3&gt;
&lt;p&gt;Okay, now that we’ve gone over the basics of each of the models as well as assessing the model fit, let’s compare how well the models predict future PM2.5 levels. This cross validation is performed by assigning a rolling window in our time series. We split this window into two pieces, the “initial” time period and the “horizon”. We fit our model using the initial time period and compare our prediction of the horizon to its actual values. I picked the RMSE as my loss function in evaluating predictive performance.&lt;/p&gt;
&lt;p&gt;A typical comparison is to compute the RMSE for each of the days in your horizon by combining all the differences between ‘y’ and ‘yhat’ from each of your rolling validations:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# RMSE by horizon
all.cv %&amp;gt;% 
  group_by(model,day) %&amp;gt;% 
  summarise(rmse=sqrt(mean((y-yhat)^2))) %&amp;gt;% 
  ggplot(.,aes(x=day,y=rmse,group=model,color=model)) +
  geom_line(alpha=.75) + geom_point(alpha=.75)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;672&#34; /&gt; This is definitely an interesting result. Clearly the Exponential Smoothing model is not the best predictor with this data. Also, when comparing how well each model predicts future events, it appears that the OLS and Random Forest regression models perform just as well as the TBATS, ARIMA, and prophet models. In the plot below, we can also take a look at how each of these forecasted data looks like for the year of 2017.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Prediction behaviors of different methods
ggplot(all.cv,aes(date,yhat,group=as.factor(cutoff),color=as.factor(cutoff)))+
  geom_line()+
  geom_line(aes(y=y),color=&amp;quot;black&amp;quot;,alpha=.15)+#geom_point(aes(y=y),color=&amp;quot;black&amp;quot;,alpha=.15)+
  facet_wrap(~model)+ guides(color=&amp;quot;none&amp;quot;) +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://nielsenmark.us/post/2018-02-19-forecasting-pm2-5-with-forecast-and-prophet_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Some of the things that you’ll probably notice first off is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The reason the Exponental Smoothing model didn’t perform so well.&lt;/li&gt;
&lt;li&gt;Since I don’t know the future regressors’ values for OLS and Random Forest regression, I just set them to the values at the end of each initial window, which resulted in straight line forecasts.&lt;/li&gt;
&lt;li&gt;ARIMA appears to not be as robust as other methods.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Of all these methods, I would probably decide on either the TBATS or prophet model in forecasting future data. I hope you have enjoyed these exercises and intro to time series in R!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;where-to-learn-more&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Where to learn more?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.futurelearn.com/courses/business-analytics-forecasting&#34;&gt;FutureLearn Forecasting MOOC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://otexts.org/fpp/&#34;&gt;Forecasting: Principles and Practice&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://facebook.github.io/prophet/&#34;&gt;Prophet: Forecasting at Scale&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;References&lt;/h3&gt;
&lt;p&gt;Hyndman, R.J. and Athanasopoulos, G. (2013) Forecasting: principles and practice. OTexts: Melbourne, Australia. &lt;a href=&#34;http://otexts.org/fpp/&#34; class=&#34;uri&#34;&gt;http://otexts.org/fpp/&lt;/a&gt;. Accessed on February 11, 2018.&lt;/p&gt;
&lt;p&gt;National Center for Environmental Information. Climate Data Online available at &lt;a href=&#34;https://www.ncdc.noaa.gov/cdo-web&#34; class=&#34;uri&#34;&gt;https://www.ncdc.noaa.gov/cdo-web&lt;/a&gt;. Accessed February 11, 2018.&lt;/p&gt;
&lt;p&gt;Sean Taylor and Ben Letham (2017). prophet: Automatic Forecasting Procedure. R package version 0.2.1.9000. &lt;a href=&#34;https://facebook.github.io/prophet/&#34; class=&#34;uri&#34;&gt;https://facebook.github.io/prophet/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;US Environmental Protection Agency. Air Quality System Data Mart [internet database] available at &lt;a href=&#34;http://www.epa.gov/ttn/airs/aqsdatamart&#34; class=&#34;uri&#34;&gt;http://www.epa.gov/ttn/airs/aqsdatamart&lt;/a&gt;. Accessed?February 11, 2018.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Creating a Custom htmlwidget for Shiny</title>
      <link>https://nielsenmark.us/2018/01/02/creating-a-custom-htmlwidget/</link>
      <pubDate>Tue, 02 Jan 2018 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/2018/01/02/creating-a-custom-htmlwidget/</guid>
      <description>&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/htmlwidgets/htmlwidgets.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/mywidget-binding/mywidget.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/d3/d3.v3.min.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://nielsenmark.us/rmarkdown-libs/hive/hive.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/hive/d3.hive.min.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/hive_no_int-binding/hive_no_int.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://nielsenmark.us/rmarkdown-libs/hive-binding/hive.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;A year ago, &lt;code&gt;htmlwidgets&lt;/code&gt; were a mystery to me. I was first introduced to them at a conference years ago. I previously used &lt;code&gt;rCharts&lt;/code&gt; which I really liked because of the ability it gave me to customize my interactive graphs in Shiny. I approached an instructor and explained my interest in &lt;code&gt;rCharts&lt;/code&gt; to him and he pointed me in the direction of &lt;code&gt;htmlwidgets&lt;/code&gt;. Last year I finally decided to take that leap and give it a try.&lt;/p&gt;
&lt;div id=&#34;setting-up-the-htmlwidget&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Setting Up the HTMLWidget&lt;/h3&gt;
&lt;p&gt;I started my learning with this &lt;a href=&#34;http://www.htmlwidgets.org/develop_intro.html&#34;&gt;tutorial&lt;/a&gt; from Ramnath V., Kenton R., and Rstudio on creating htmlwidgets, in which it defines that “the htmlwidgets package provides a framework for creating R bindings to JavaScript libraries.” Following along with this tutorial we see that we can easily create our first htmlwidget.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;devtools::create(&amp;quot;mywidget&amp;quot;)
setwd(&amp;quot;mywidget&amp;quot;)
htmlwidgets::scaffoldWidget(&amp;quot;mywidget&amp;quot;)
devtools::install()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One thing to note about htmlwidgets is that they are always hosted in an R package to ensure full reproducibility.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;file-structure&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;File Structure&lt;/h3&gt;
Next, let’s follow the tutorial further and take a look at the file structure.
&lt;pre&gt;&lt;code&gt;
.
├── DESCRIPTION
├── inst
│   └── htmlwidgets
│       ├── mywidget.js
│       └── mywidget.yaml
├── mywidget.Rproj
├── NAMESPACE
└── R
    └── mywidget.R
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We see here that in order to bind our JavaScript library to our new R package we need to include both some R code (&lt;code&gt;mywidget.R&lt;/code&gt;) and JavaScript (&lt;code&gt;mywidget.js&lt;/code&gt;). All the JavaScript, YAML, and other dependencies will be located in the &lt;code&gt;inst\htmlwidgets&lt;/code&gt; folder. The R code is located in the &lt;code&gt;R&lt;/code&gt; folder which should define the inputs to our new function we are creating. Below you can see the sample htmlwidget we have created takes a character string as input and it will create a html page and pass through our character string to the JavaScript code.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(mywidget)
mywidget(&amp;quot;Hello World&amp;quot;,height=&amp;quot;100px&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;htmlwidget-1&#34; style=&#34;width:672px;height:100px;&#34; class=&#34;mywidget html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-1&#34;&gt;{&#34;x&#34;:{&#34;message&#34;:&#34;Hello World&#34;},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p&gt;Vioala! Your first htmlwidget AND the classic “Hello World”. Okay, okay… maybe this isn’t as awesome as you were thinking, but we can do even better. Are you ready to create your first htmlwidget?&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-1-adding-your-own-javascript-code&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 1: Adding your own JavaScript code&lt;/h3&gt;
&lt;p&gt;First let’s find some code for the popular JavaScript library D3. I am not a web developer so I found mine in a blog post by Mike Bostock. I really liked the functionality and look of his D3 implementation of &lt;a href=&#34;https://bost.ocks.org/mike/hive/&#34;&gt;hive plots&lt;/a&gt;. Hive plots are credited to Martin Krzysinski. You’ll find Martin’s introduction to hive plots &lt;a href=&#34;https://academic.oup.com/bib/article/13/5/627/412507/Hive-plots-rational-approach-to-visualizing&#34;&gt;here&lt;/a&gt;. A simpler version of Mike’s implementation is found &lt;a href=&#34;https://bl.ocks.org/mbostock/2066415&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
Now that I’ve got my code I’m going to replace the JavaScript code in &lt;code&gt;./inst/htmlwidgets/hive.js&lt;/code&gt; with this:
&lt;pre&gt;
&lt;code class=&#34;hljs&#34; data-trim&gt;
HTMLWidgets.widget({

  name: &#39;hive_no_int&#39;,

  type: &#39;output&#39;,

  factory: function(el, width, height) {

    // TODO: define shared variables for this instance

    return {

      renderValue: function(x) {

        // alias options
        var options = x.options;

        // convert links and nodes data frames to d3 friendly format
        var nodes = HTMLWidgets.dataframeToD3(x.nodes);
        var prelinks = HTMLWidgets.dataframeToD3(x.links);

        // create json of link sources and targets
        var links = [];
        prelinks.forEach(function(d){
          var tmp = {};
          tmp.source=nodes[d.source];
          tmp.target=nodes[d.target];
          links.push(tmp);
        });

        var innerRadius = options.innerRadius,
            outerRadius = options.outerRadius;

        var angle = d3.scale.ordinal().domain(d3.range(x.numAxis+1)).rangePoints([0, 2 * Math.PI]),
            radius = d3.scale.linear().range([innerRadius, outerRadius]),
            color = d3.scale.category10().domain(d3.range(20));

        // select the svg element and remove existing children
        var svg = d3.select(el).append(&#34;svg&#34;)
          .attr(&#34;width&#34;, width)
          .attr(&#34;height&#34;, height)
          .append(&#34;g&#34;)
          .attr(&#34;transform&#34;, &#34;translate(&#34; + width / 2 + &#34;,&#34; + height / 2 + &#34;)&#34;);

        svg.selectAll(&#34;.axis&#34;)
            .data(d3.range(x.numAxis))
            .enter().append(&#34;line&#34;)
            .attr(&#34;class&#34;, &#34;axis&#34;)
            .attr(&#34;transform&#34;, function(d) {
              return &#34;rotate(&#34; + degrees(angle(d)) + &#34;)&#34;;
            })
            .attr(&#34;x1&#34;, radius.range()[0])
            .attr(&#34;x2&#34;, radius.range()[1]);

        // draw links
        var link = svg.selectAll(&#34;.link&#34;)
            .data(links)
            .enter().append(&#34;path&#34;)
            .attr(&#34;class&#34;, &#34;link&#34;)
            .attr(&#34;d&#34;, d3.hive.link()
              .angle(function(d) { return angle(d.x); })
              .radius(function(d) { return radius(d.y); }))
            .style(&#34;stroke&#34;, function(d) { return color(d.source.color); })
            .style(&#34;stroke-width&#34;, 1.5)
            .style(&#34;opacity&#34;, options.opacity);

        // draw nodes
        var node = svg.selectAll(&#34;.node&#34;)
            .data(nodes)
            .enter().append(&#34;circle&#34;)
            .attr(&#34;class&#34;, &#34;node&#34;)
            .attr(&#34;transform&#34;, function(d) {
              return &#34;rotate(&#34; + degrees(angle(d.x)) + &#34;)&#34;;
            })
            .attr(&#34;cx&#34;, function(d) { return radius(d.y); })
            .attr(&#34;r&#34;, 5)
            .style(&#34;fill&#34;, function(d) { return color(d.color); })
            .style(&#34;stroke&#34;, &#34;#000000&#34;);

        function degrees(radians) {
          return radians / Math.PI * 180 - 90;
        }

      }

    };
  }
});
&lt;/code&gt;
&lt;/pre&gt;
&lt;p&gt;Next, I copy supporting JS and CSS code into &lt;code&gt;./inst/htmlwidgets/lib/&lt;/code&gt; folder. For this project I’ll need &lt;code&gt;d3.js&lt;/code&gt; as well as some code from Mike’s post to create our visualization. Here’s what is now contained in the &lt;code&gt;./inst/htmlwidgets/lib/&lt;/code&gt; folder:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;## d3-3.0/d3.v3.min.js
## hive-0.1/d3.hive.min.js
## hive-0.1/hive.css&lt;/code&gt;&lt;/pre&gt;
And finally, I define those dependencies in &lt;code&gt;./inst/htmlwidgets/hive.yaml&lt;/code&gt; as seen below:
&lt;pre&gt;
&lt;code class=&#34;hljs&#34; data-trim&gt;
# (uncomment to add a dependency)
dependencies:
  - name: d3
    version: 3.0
    src: htmlwidgets/lib/d3-3.0
    script:
      - d3.v3.min.js
  - name: hive
    version: 0.1
    src: htmlwidgets/lib/hive-0.1
    script:
      - d3.hive.min.js
    stylesheet:
      - hive.css
&lt;/code&gt;
&lt;/pre&gt;
&lt;p&gt;Now that our dependencies are defined we can now create the bindings between R and JavaScript.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2-create-the-bindings&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 2: Create the Bindings&lt;/h3&gt;
Okay, the goal in this next step is to get our R dataframe to look just like this d3 dataset from the hive plot D3 code.
&lt;pre&gt;
&lt;code class=&#34;hljs&#34; data-trim&gt;
var nodes = [
  {x: 0, y: .1},
  {x: 0, y: .9},
  {x: 1, y: .2},
  {x: 1, y: .3},
  {x: 2, y: .1},
  {x: 2, y: .8}
];
var links = [
  {source: nodes[0], target: nodes[2]},
  {source: nodes[1], target: nodes[3]},
  {source: nodes[2], target: nodes[4]},
  {source: nodes[2], target: nodes[5]},
  {source: nodes[3], target: nodes[5]},
  {source: nodes[4], target: nodes[0]},
  {source: nodes[5], target: nodes[1]}
];
&lt;/code&gt;
&lt;/pre&gt;
&lt;p&gt;First, let’s tell R what it needs to pass through to our JavaScript library. This is done by creating a function that will take our data and options as arguments and combine them into a list. This list is then passed through the &lt;code&gt;htmlwidget::createWidget&lt;/code&gt; function to be picked up by our JavaScript code. Below I used code provided in Rstudio’s tutorial and also replicate the options &lt;code&gt;innerRadius&lt;/code&gt;, &lt;code&gt;outerRadius&lt;/code&gt;, and &lt;code&gt;opacity&lt;/code&gt; from Mike Bostock’s function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;hive &amp;lt;- function(nodes, 
                 links, 
                 innerRadius = 40, 
                 outerRadius = 240, 
                 opacity = 0.7, 
                 width = NULL, 
                 height = NULL, 
                 elementId = NULL) {

  # sort in order of node id
  if(&amp;quot;id&amp;quot; %in% colnames(nodes)) {
    nodes &amp;lt;- nodes[order(nodes$id),]
    nodes$id &amp;lt;- NULL
  }

  # color by axis if no coloring is supplied
  if(!(&amp;quot;color&amp;quot; %in% colnames(nodes))) {
    nodes$color &amp;lt;- nodes$x
  }

  # forward options using x
  x = list(
    nodes = nodes,
    links = links,
    numAxis = max(nodes$x)+1,
    options = list(innerRadius=innerRadius,
                   outerRadius=outerRadius,
                   opacity=opacity)
  )

  # create widget
  htmlwidgets::createWidget(
    name = &amp;#39;hive&amp;#39;,
    x,
    width = width,
    height = height,
    package = &amp;#39;hiveD3&amp;#39;,
    elementId = elementId
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice above that the objects &lt;code&gt;nodes&lt;/code&gt; and &lt;code&gt;links&lt;/code&gt; are R dataframes and that the final list &lt;code&gt;x&lt;/code&gt; is passed through to JS.&lt;/p&gt;
Now that we’ve defined our R binding, let’s take a minute and set up the JavaScript binding in the &lt;code&gt;hive.js&lt;/code&gt; file. For d3, we use the &lt;code&gt;dataframeToD3()&lt;/code&gt; helper function. I’m not awesome with JavaScript, so I’m going to avoid making too many changes to this code:
&lt;pre&gt;
&lt;code class=&#34;hljs&#34; data-trim&gt;
// alias options
var options = x.options;

// convert links and nodes data frames to d3 friendly format
var nodes = HTMLWidgets.dataframeToD3(x.nodes);
var prelinks = HTMLWidgets.dataframeToD3(x.links);

// create json of link sources and targets
var links = [];
prelinks.forEach(function(d){
  var tmp = {};
  tmp.source=nodes[d.source];
  tmp.target=nodes[d.target];
  links.push(tmp);
});
&lt;/code&gt;
&lt;/pre&gt;
&lt;p&gt;To give you an understanding of what is under the hood of the &lt;code&gt;dataframeToD3&lt;/code&gt; function, &lt;code&gt;jsonlite::toJSON&lt;/code&gt; is used to convert the dataframe to long-form representation. And when you look at the data you can see that recreating &lt;code&gt;nodes&lt;/code&gt; is easy. As for &lt;code&gt;links&lt;/code&gt;, we read in the data as &lt;code&gt;prelinks&lt;/code&gt; then we need to add a loop to loop through each item of &lt;code&gt;prelinks&lt;/code&gt; and finally create &lt;code&gt;links&lt;/code&gt; just like it is in Mike’s JavaScript code.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3-putting-it-all-together&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 3: Putting it all together&lt;/h3&gt;
&lt;p&gt;All of our bindings are set up and once I’ve built and loaded my package, we’re ready to define some dataframes and test out our new htmlwidget.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(hiveD3)
nodes = data.frame(id=c(0,1,2,3,4,5,6,7,8),
                   x=c(0,0,1,1,2,2,3,3,4), 
                   y=c(.1,.9,.2,.3,.1,.8,.3,.5,.9))
links = data.frame(source=c(0,1,2,2,3,4,5,6,7,8,8),
                   target=c(2,3,4,5,5,6,7,8,8,0,1))


hive_no_int(nodes=nodes,links=links, width = &amp;quot;700px&amp;quot;, height = &amp;quot;500px&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When we run the &lt;code&gt;hive&lt;/code&gt; function we see our new visualization! Note that for demonstration purposes only I’ve renamed this first function &lt;code&gt;hive_no_int&lt;/code&gt;. &lt;div id=&#34;htmlwidget-2&#34; style=&#34;width:700px;height:500px;&#34; class=&#34;hive_no_int html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-2&#34;&gt;{&#34;x&#34;:{&#34;nodes&#34;:{&#34;x&#34;:[0,0,1,1,2,2,3,3,4],&#34;y&#34;:[0.1,0.9,0.2,0.3,0.1,0.8,0.3,0.5,0.9],&#34;color&#34;:[0,0,1,1,2,2,3,3,4]},&#34;links&#34;:{&#34;source&#34;:[0,1,2,2,3,4,5,6,7,8,8],&#34;target&#34;:[2,3,4,5,5,6,7,8,8,0,1]},&#34;numAxis&#34;:5,&#34;options&#34;:{&#34;innerRadius&#34;:40,&#34;outerRadius&#34;:240,&#34;opacity&#34;:0.7}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;&lt;/p&gt;
&lt;p&gt;Alright! We’re ready to show off our work, but can you guess the first question that is going to be asked of you? Your friends may think it’s cool, but will say “Why doesn’t it do anything when I hover over it?” or “Why can’t I interact with it?” Well, so much for not having to tweak any JavaScript code. It’s time to dive in and add some interactivity.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4-making-finishing-touches&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 4: Making Finishing Touches&lt;/h3&gt;
&lt;p&gt;Let’s look at some next steps in getting our htmlwidget ready for prime time: - Adding interaction - Creating and sharing your package - Creating R documentation using RStudio and &lt;code&gt;roxygen2&lt;/code&gt; - Adding your package to &lt;a href=&#34;http://gallery.htmlwidgets.org/&#34;&gt;htmlwidget gallery&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We’ve talked about adding interaction, and once that is ready you can share your new package in several ways. Make sure to create helpful documentation for your new package before sharing on Github or on the &lt;a href=&#34;http://gallery.htmlwidgets.org/&#34;&gt;htmlwidget gallery&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;the-final-product&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;The Final Product&lt;/h4&gt;
&lt;p&gt;Great! I’ve gone ahead and added my package to GitHub. Of course, I did this after making sure to create some documentation and interactivity… and finally, we can show it off.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(devtools)
install_github(&amp;#39;nielsenmarkus11/hiveD3&amp;#39;)

library(hiveD3)

nodes = data.frame(id=c(0,1,2,3,4,5,6,7,8),
                   x=c(0,0,1,1,2,2,3,3,4), 
                   y=c(.1,.9,.2,.3,.1,.8,.3,.5,.9))
links = data.frame(source=c(0,1,2,2,3,4,5,6,7,8,8),
                   target=c(2,3,4,5,5,6,7,8,8,0,1))

hive(nodes=nodes,links=links, width = &amp;quot;700px&amp;quot;, height = &amp;quot;500px&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;htmlwidget-3&#34; style=&#34;width:700px;height:500px;&#34; class=&#34;hive html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-3&#34;&gt;{&#34;x&#34;:{&#34;nodes&#34;:{&#34;x&#34;:[0,0,1,1,2,2,3,3,4],&#34;y&#34;:[0.1,0.9,0.2,0.3,0.1,0.8,0.3,0.5,0.9],&#34;color&#34;:[0,0,1,1,2,2,3,3,4]},&#34;links&#34;:{&#34;source&#34;:[0,1,2,2,3,4,5,6,7,8,8],&#34;target&#34;:[2,3,4,5,5,6,7,8,8,0,1]},&#34;numAxis&#34;:5,&#34;options&#34;:{&#34;innerRadius&#34;:40,&#34;outerRadius&#34;:240,&#34;opacity&#34;:0.7}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p&gt;Thanks for taking some time to check out my explorations with &lt;code&gt;htmlwidgets&lt;/code&gt;. What are the next steps for your project? Maybe someday I’ll put my stuff out on CRAN, and I definitely want to add some more interactivity and flexibility to my package. You can download and check it out by installing it from &lt;a href=&#34;https://github.com/nielsenmarkus11/hiveD3&#34;&gt;my GitHub page&lt;/a&gt;. Good luck!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Bostock M, Morin R (2012). &lt;a href=&#34;https://bost.ocks.org/mike/hive/&#34;&gt;Hive Plots&lt;/a&gt;. Retrieved from &lt;a href=&#34;https://bost.ocks.org/mike/hive/&#34; class=&#34;uri&#34;&gt;https://bost.ocks.org/mike/hive/&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bostock M (2016). &lt;a href=&#34;https://bl.ocks.org/mbostock/2066415&#34;&gt;Hive Plot (Links)&lt;/a&gt;. Retrieved from &lt;a href=&#34;https://bl.ocks.org/mbostock/2066415&#34; class=&#34;uri&#34;&gt;https://bl.ocks.org/mbostock/2066415&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bostock M (2017). &lt;a href=&#34;https://d3js.org/&#34;&gt;D3 Data-Driven Documents&lt;/a&gt;. Retrieved from &lt;a href=&#34;https://d3js.org/&#34; class=&#34;uri&#34;&gt;https://d3js.org/&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Krzywinski M, Birol I, Jones S, Marra M (2011). &lt;a href=&#34;https://academic.oup.com/bib/article/13/5/627/412507/Hive-plots-rational-approach-to-visualizing&#34;&gt;Hive Plots — Rational Approach to Visualizing Networks&lt;/a&gt;. Briefings in Bioinformatics (early access 9 December 2011, doi: 10.1093/bib/bbr069).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;Vaidyanathan R, Russell K, RStudio, Inc. (2014-2015). &lt;a href=&#34;http://www.htmlwidgets.org/develop_intro.html&#34;&gt;Creating a widget&lt;/a&gt;. Retrieved from &lt;a href=&#34;http://www.htmlwidgets.org/develop_intro.html&#34; class=&#34;uri&#34;&gt;http://www.htmlwidgets.org/develop_intro.html&lt;/a&gt;.
&lt;/section&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>First Post</title>
      <link>https://nielsenmark.us/2017/12/02/first-blog-post/</link>
      <pubDate>Sat, 02 Dec 2017 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/2017/12/02/first-blog-post/</guid>
      <description>&lt;p&gt;Hello World&lt;/p&gt;

&lt;p&gt;Yeah! I&amp;rsquo;ve finally got the blog up and running! &lt;a href=&#34;https://gohugo.io/&#34;&gt;Hugo&lt;/a&gt; has so far been great to use and easy to learn. There will definitely be more posts on programming and statistics with R to come in the future, so hang tight.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Blogroll</title>
      <link>https://nielsenmark.us/1/01/01/blogroll/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://nielsenmark.us/1/01/01/blogroll/</guid>
      <description>&lt;p&gt;Check out these other useful blogs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.r-bloggers.com&#34;&gt;R-Bloggers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.levithatcher.com&#34;&gt;Levi Thatcher&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
  </channel>
</rss>
