• Votes for this article no votes for this yet
  • Dashboard Insight Newsletter Sign Up

Box & Strip Plots with R

by Steve Miller, President, OpenBITuesday, September 06, 2011

I recently completed an analysis on a time series data set of the daily number of live births in Quebec from 1977 through 1991, using the R Project for Statistical Computing software. The 5,113 observations were a lot for the standard time-series graphics (Figure 1) I usually use as a prelude to forecasting exercises. From the patterns I could discern though, I guessed that there were both month of year and day of week signals that would be interesting to investigate. So I set out to look at both using trusty box and strip plots – and combinations of both – from R's lattice graphics package.

Figure 1: Click image to enlarge

Figure 2 shows traditional box plots of births by day of week on the left and by month on the right. The box for each day of the week holds the middle 50% of the observations for that day —   the 25th to 75th percentiles — known as  the interquartile range. The arms of the box plot extend out to all observations within 1.5 times the interquartile range above the 75th percentile, and the same distance below the 25th percentile. Points outside those bands are identified as outliers.

Figure 2: Click image to enlarge

It's clear from these graphs that there was less activity on weekends than during the week. Tuesday through Friday are the biggest days, not surprising given that C-sections, which historically account for more than 25% of deliveries, are generally scheduled after Monday. There are a fair number of low outliers between Monday and Friday and there appears to be seasonality in the data as well, with more deliveries in the warm months, no doubt reflecting cold weather activities.

Though the box plots by day of week and month clearly show both the periodicity and seasonality, they hide some of the detail related to the overall shape of the distributions. The violin plots of Figure 3 overcome this weakness, with the shape and size of the violins denoting where the data points fall. The larger the width of a “piece” of the violin, the greater the number of points that fall at that particular birth frequency. Along with many analysts,  I prefer the simplicity of violin plots.

Figure 3: Click image to enlarge

If the data set isn't too large, an attractive visual option is the strip plot, which simply shows all data points. This can be especially helpful in visualizations of dispersed data with outliers. Think of a strip plot as a scatter where one of the variables is categorical. Figure 4 details strip plots of the Quebec data, with horizontal “jittering,” so that all points at the same birth frequency denote the same value for the particular month. The patterns of the raw data appear to confirm findings from the box and violin plots.

Figure 4: Click image to enlarge

The hybrid plots of Figure 5 offer the benefits of both summarization and raw data display. These graphs start with lighter-textured points of strip plots, adding segments and a dash to indicate summaries such as mean, median, and mode. In this rendition, 95% of points fall within the extremes of the top and bottom segments, the minimum and maximum observations indicated by the data points themselves. The middle dash connotes the group median, while the interquartile range lies between the segments.

Figure 5: Click image to enlarge

R's lattice graphics presented above are notable for the ease with which they can communicate dimensional relationships. Figure 6 expands the hybrid day of week and month plots to include quarter.  In this graph, sub-plots are drawn separately for each quarter, the functions automatically maintaining common axes and scales to facilitate inter-panel comparisons. The overall higher deliveries in Q2 and Q3 confirm previous observations. Adding “by” variable dimensions to expand analysis is trivial with lattice.

Figure 6: Click image to enlarge

It's also simple to show even more dimensions using lattice , although that risks the plots becoming unwieldy. Figure 7 adds a period dimension to the mix, contrasting the first seven years to the last seven. Deliveries in Q3 and Q4 appear to be down in the 1984-1991 time frame.

Figure 7: Click image to enlarge

The lattice package  provides powerful graphics for BI and analytics that can easily be extended programmatically. The ability to create dimensional or lattice-like plots simply should make the package a top choice of analysts looking for analytical sophistication in their visuals. If Dashboard Insight readers are interested, I will continue to share some of the graphical insights I gather from my work.

About the Author:

Steve Miller is President of OpenBI, LLC, a Chicago-based services firm focused on delivering business intelligence solutions with open source software. A statistician/quantitative analyst by education, Steve has 30 years BI experience. His charter – and OpenBI's – is to help customers manage performance through optimal deployment of analytics. Steve is a columnist for DMReview and writes also for BIReview and the B-Eye-Network. In addition to R, OpenBI specializes in the Pentaho and JasperSoft open source BI platforms and Weka data mining. Steve can be reached at steve.miller@openbi.com.

Tweet article    Stumble article    Digg article    Buzz article    Delicious bookmark      Dashboard Insight RSS Feed
Other articles by this author


No comments have been posted yet.

Site Map | Contribute | Privacy Policy | Contact Us | Dashboard Insight © 2017