I’ve contributed six “pieces” to Dashboard Insight (DI) since first meeting the site back in the fall. Of the six, five have related to the R Project for Statistical Computing. A numbers guy, I won’t change that pattern now. This column is the first in a series that focuses on using the R language, statistical procedures, and graphics capabilities to explore, analyze, and present data. The hope is that the articles might convince a few BI analysts to take the plunge with an outstanding, freely-available product.
R’s open source lineage provides many benefits to users, not the least of which is that it’s easily accessible at no cost from the Comprehensive R Network (CRAN) http://cran.r-project.org/. Perhaps the most gratifying but nevertheless challenging aspect of the R project for BI is the significant worldwide participation by a community that is constantly developing add-on procedures and packages. Gratifying in that R users have access to the very latest statistical and machine learning procedures long before they’re available commercially. Frustrating in that it’s very difficult to keep track of the largesse. Fortunately, R web-based documentation now includes “views”, such as Multivariate (http://cran.r-project.org/web/views/Multivariate.html), Social Sciences, Econometrics, and Machine Learning that track and catalog the latest advances by the community. You can’t tell the players without a scorecard!
In previous DI articles, I’ve noted that R’s graphics are unsexy in contrast with dashboard and visualization software currently available on the commercial market. At the same time, I see much of that commercial sexiness as gratuitous, serving no information-sharing role. It’s sizzle for sizzles’ sake. Unsexy R dashboards, on the other hand, can often reliably convey an incredible amount of information on a very compact canvas. Other things equal, I’ll take effective and economical communication over sexy graphics. Perhaps those who want both steak and sizzle can develop information-rich prototypes in R to challenge their favorite commercial vendors!
Graphics and visualization in R are the beneficiaries of R’s active world-wide community in much the same way as statistical and machine learning procedures. The R core team supports base and lattice packages, but community members have extended these foundations extensively. Now, many advanced graphical and visualization techniques in R are available for download. R visuals are particularly adept at exploratory analysis, able to help make sense of complicated multivariate data. The graphics are also closely tied to statistical procedures, and often the preferred approach for communicating results. I believe the same characteristics that ingratiate R graphics to the statistical world are relevant for BI as well.
The Data
I struggled at first finding data to use for this series. In the end, rather than use data from past consulting, I decided to scour the Internet in search of a few pertinent data sets similar to those that might be investigated by BI practitioners in day-to-day practice. I found two – one of which is part of an R package that extends the core product.
|
cps_wages = list 11 (28896 bytes)( data.frame )
1 Education = integer 534= 8 9 12 12 12 13 ... 2 South = integer 534= category (2 levels)( factor )= Elsewhere Elsewhere ... 3 Sex = integer 534= category (2 levels)( factor )= Female Female Male ... 4 Experience = integer 534= 21 42 1 4 17 9 27 ... 5 Union = integer 534= category (2 levels)( factor )= Non-Union Non-Union ... 6 Wage = double 534= 5.1 4.95 6.67 4 ... 7 Age = integer 534= 35 57 19 22 35 28 ... 8 Race = integer 534= category (3 levels)( factor )= Hispanic White White ... 9 Occupation = integer 534= category (6 levels)( factor )= Other Other Other ... 10 Sector = integer 534= category (3 levels)( factor )= Manufacturing Manufacturing ... 11 Marital = integer 534= category (2 levels)( factor )= Married Married ... A2 row.names = integer 534= 1 2 3 4 5 6 7 8 ... |
(Table 1a)
The first data set, cps_wages, http://lib.stat.cmu.edu/datasets/CPS_85_Wages, consists of 534 records and 11 variables (Table 1a). This data is used to analyze the determinants of wages in a 1985 population sample, with special focus on differences between the sexes. Among the predictors of Wage are integer variables Age, Experience, and Education, as well as factors (categorical variables) Sex, Race, Marital, Occupation, and Union. A major point of interest is whether there’s a difference in Wage by Sex after the other variables have been accounted for.
|
NewHavenResidential = list 8 (1386248 bytes)( data.frame )
1 totalCurrVal = integer 18221= 109200 107380 80500 ... 2 livingArea = integer 18221= 1480 1792 1239 984 ... 3 dep = integer 18221= 24 11 26 35 13 27 ... 4 size = double 18221= 0.25 0.18 0.36 0.32 ... 5 zone = integer 18221= category (3 levels)( factor )= RS RS RS RS RS RS ... 6 acType = integer 18221= category (2 levels)( factor )= No AC No AC No AC ... 7 bedrms = integer 18221= 3 3 3 3 5 3 2 1 ... 8 bathrms = double 18221= 2.5 2 1 1 2 1.5 ... A3 row.names = character 18221= 1 2 3 4 5 6 7 8 9 10 ... |
(Table 1b)
The second data set, NewHavenResidential, is included as part of the add-on R package YaleToolkit (http://cran.r-project.org/web/packages/YaleToolkit/index.html), developed, not surprisingly, at Yale University. Table 1b describes NewHavenResidential, which consists of 18,221 records with 8 variables on the residential property value market for 2006 in New Haven, Ct. The main dependent variable of interest is totalCurrVal, the assessed property value. Predictors include numerical attributes livingArea (square footage), bathrms, bedrms, size (propert acreage), and dep (depreciation percent) as well as factors zone (residential area), and acType (whether or not the residence is air conditioned). Our task is to investigate totalCurrval as a function of the predictors.
Graphics
The first look at the data comes courtesy of the Hmisc package (http://cran.r-project.org/web/packages/Hmisc/index.html) developed by Frank Harrell of Vanderbilt University. Among its many goodies, Hmisc features a graphics function, hist.data.frame, which takes an R data frame as an argument and produces a graph detailing the distributions of each of the attributes.

(Figure 1a)
Figure 1a shows the output for cps_wages. For each numeric variable, hist.data.frame plots a histogram with about 20 bins. For each factor variable, hist.data.frame computes a dotplot detailing frequencies of the factor levels in sorted order. cps_wages consists primarily of white workers, with more males than females, more marrieds than singles, and a mix of Occupation and Sector, with “other” the ominous leader in both cases. The histogram for Wage is skewed right, suggesting a transformation to logarithmic scale to facilitate subsequent analyses.

(Figure 1b)
NewHavenResidential is summarized graphically in Figure 1b. Most properties have no AC and more are in zone RM than not. Note the asymmetric distributions of dep and bathrms. The histograms for continuous variables totalCurrVal, livingArea, and size are skewed right, suggesting a logarithmic transformation.

(Figure 2)
Figure 2 shows the density plots of these variables before and after the log transformation. I added a small amount, .01, to the size variable that had values of 0 before taking the log. Since size represents lot acreage, this was not an unreasonable adjustment. The “rugs” at the bottom of each plot section are essentially stripplots that comprise the densities. The less extreme, symmetric , transformed variables are more tractable for graphics and prediction and therefore replace the raw values for subsequent analyses.
I turn to the YaleToolKit package to use its gpairs graph to show the relationships between each pair of attributes for both data sets. Core R also supports similar graphs, pairs and splom, but gpairs and its cousin corrgram from YaleToolkit add several extensions I like. It’s often the case in the R world that a package embellishes a popular function or core graphics routine, and I generally find the time investigating such freebies well spent. Interestingly, the YaleToolKit also supports sparklines, the data-intense but conceptually simple visualizations introduced by Yale professor emeritus and “graphics Galileo” Edward Tufte.

(Figure 3a)
Figure 3a shows the gpairs graph for the cps_wages data. This rather menacing looking creation is actually quite harmless – and extremely informative. Look first at the diagonal of cells that runs from top-left to bottom-right. Within each cell is a summary for the named variable, with bar charts for factors and histograms for numerics. The one-way results confirm those from the hist.data.frame of Figure 1a. The off-diagonal cells depict the bivariate relationships between variables determined by the intersection of the diagonal element looking north and south with the element east and west. The second cell in the top row, for example, details the relationship between Education and South. Since Education is numeric while South is a factor, the graph presented is a strip or barcode plot with separate distributions of Education for each value of the factor (South and Elsewhere). Where both intersecting cells are numeric, the graph is a basic scatterplot. And if both attributes are factors, the graph is a mosaicplot in which the size of the grey rectangles indicates relative frequencies of the levels. The matrix is symmetric, so all bivariate relationships are given twice. gpairs acknowledges this repetition, attempting to show different spins on the graphs above and below the diagonal.
Note the strong relationship between Experience and Age, suggesting one of the two may be redundant for predicting a multivariate relationship with Wage. There’s a fairly strong positive relationship between Education and Wage, with modest positive correlations between Wage and both Experience and Age. Those not living in the South seem to have higher incomes, as do Male, White, Married, Professional, and Union workers.

(Figure 3b)
Figure 3b details a gpairs graph for NewHavenResidential. totalCurrVal, now the base 10 log of assessed property value, shows a strong positive relationship with log base 10 of livingArea, a modest negative correlation with dep (depreciation), and positive relationships with bathrms, bedrms, and the base log 10 of size. Higher values of totalCurrVal are associated with zone RS as well as acType AC. In all, totalCurrVal appears to have a pretty strong complement of individual predictors.
This column has introduced two useful R graphics for preliminary visualization of data sets. The first considers each attribute separately; the second shows pairwise relationships. In tandem with summary descriptive statistics for each variable, these plots give a solid high-level view of the data, providing hints of deeper associations to follow. Subsequent articles will examine multi-dimensional plots and more advanced statistical graphics for analyzing these data.