It seems I’m always second guessing the writing for my monthly BI columns. Whenever I re-read a several-month-old analytics article, for example, I question not only the wordsmithing, but also the choice of stats and visuals. Why that wording? That analytic? That graphic? On second look, there always appears to be a better way. I suppose it’s a good thing a platform as comprehensive as R provides so many choices for analysts.
There are certainly many different ways of visualizing a given set of data in R. R includes several separate graphics subsystems, one being the powerful Lattice, which provides the basis for many types of dimensional visuals. Lattice programmers can work at three different levels of complexity, with each level able to progressively expand visual capabilities. The first approach is to simply script the Lattice function syntax as prescribed, making sure to be cognizant of all options and parameters. It’s often the case with even simple Lattice programming that data must be presented in a “stacked” format amenable to the Lattice calling functions. In a slightly more complex mode, Lattice provides programmable access to the basic panel functions underpinning each of its graphics like xyplot, stripplot, boxplot, and dotplot, along with a set of panel primitives that allow programmers to embellish and change each graphic’s functionality. Finally, R provides access to its low level Grid Graphics Model, giving industrious programmers the ability to develop entirely new visuals and graphics subsystems. Though I work at all three levels, I often find the ability to program panel functions to be quite productive for the effort involved, and illustrate several possibilities in the graphs that follow.
The data used for this column’s graphics derives from the 18221 case New Haven Residential data set detailed in “Data Analysis and R (the first in a series)”, http://dashboardinsight.com/articles/new-concepts-in-business-intelligence/et-data-analysis-r-the-first-series.aspx . Attributes of interest include appraised housing value (totalCurrVal) and its base 10 logarithm (logtotalCurrVal); square footage (livingArea) and its log (loglivingArea), as well as a quartile grouping of livingArea (livingArea_cat); and, zone, a three-valued factor denoting residential area.
Click on image for full size version
Figure 1a is a basic stripplot detailing the distribution of logtotalCurrVal for each of four roughly equal quartile categories of livingArea, with Q4 representing the quartile of largest homes, and Q1 the smallest. The graph deploys jittering to space data within grouping bands. Not surprisingly, there appears to be a strong association between livingArea_cat and logtotalCurrval. Note also the wider variation of logtotalCurrVal in Q1 and Q4. The reader must do mental algebra to compute the anti-logs of logtotalCurrVal.
Click on image for full size version
Figure 1b adds a number of enhancements to 1a. First, it uses color to show the zone in the strip points. RM appears to be a more expensive area than RS, with the smaller Other in between. This graph has also introduced a set of grayed rectangles that help provide anchoring support for viewing the multiple panels. The darker, inside rectangle envelops the middle 50% (interquartile range) of logtotalCurrval values; the larger encompassing rectangle surrounds the middle 95% of logtotalCurrval observations. The irregular scale points translate from log values, noting the min, max, and median, in addition to the 25th, 75th, 2.5th, and 97.5th percentiles. In contrast to 1a, 1b is a lot to digest.