• Votes for this article 6 people voted for this
  • Dashboard Insight Newsletter Sign Up

Data Analysis with R (the first in a series)
R and BI

by Steve Miller, President, OpenBIMonday, March 10, 2008

I’ve contributed six “pieces” to Dashboard Insight (DI) since first meeting the site back in the fall. Of the six, five have related to the R Project for Statistical Computing. A numbers guy, I won’t change that pattern now. This column is the first in a series that focuses on using the R language, statistical procedures, and graphics capabilities to explore, analyze, and present data. The hope is that the articles might convince a few BI analysts to take the plunge with an outstanding, freely-available product.

R’s open source lineage provides many benefits to users, not the least of which is that it’s easily accessible at no cost from the Comprehensive R Network (CRAN)  http://cran.r-project.org/.  Perhaps the most gratifying but nevertheless challenging aspect of the R project for BI is the significant worldwide participation by a community that is constantly developing add-on procedures and packages. Gratifying in that R users have access to the very latest statistical and machine learning procedures long before they’re available commercially. Frustrating in that it’s very difficult to keep track of the largesse. Fortunately, R web-based documentation now includes “views”, such as Multivariate (http://cran.r-project.org/web/views/Multivariate.html), Social Sciences, Econometrics, and Machine Learning that track and catalog the latest advances by the community. You can’t tell the players without a scorecard!

In previous DI articles, I’ve noted that R’s graphics are unsexy in contrast with dashboard and visualization software currently available on the commercial market. At the same time, I see much of that commercial sexiness as gratuitous, serving no information-sharing role. It’s sizzle for sizzles’ sake.  Unsexy R dashboards, on the other hand, can often reliably convey an incredible amount of information on a very compact canvas. Other things equal, I’ll take effective and economical communication over sexy graphics. Perhaps those who want both steak and sizzle can develop information-rich prototypes in R to challenge their favorite commercial vendors!

Graphics and visualization in R are the beneficiaries of R’s active world-wide community in much the same way as statistical and machine learning procedures. The R core team supports base and lattice packages, but community members have extended these foundations extensively.  Now, many advanced graphical and visualization techniques in R are available for download. R visuals are particularly adept at exploratory analysis, able to help make sense of complicated multivariate data. The graphics are also closely tied to statistical procedures, and often the preferred approach for communicating results. I believe the same characteristics that ingratiate R graphics to the statistical world are relevant for BI as well.

The Data

I struggled at first finding data to use for this series. In the end, rather than use data from past consulting, I decided to scour the Internet in search of a few pertinent data sets similar to those that might be investigated by BI practitioners in day-to-day practice. I found two – one of which is part of an R package that extends the core product.  

cps_wages = list 11 (28896 bytes)( data.frame )

1 Education = integer 534= 8 9 12 12 12 13 ...
2 South = integer 534= category (2 levels)( factor )= Elsewhere Elsewhere ...
3 Sex = integer 534= category (2 levels)( factor )= Female Female Male ...
4 Experience = integer 534= 21 42 1 4 17 9 27 ...
5 Union = integer 534= category (2 levels)( factor )= Non-Union Non-Union ...
6 Wage = double 534= 5.1 4.95 6.67 4 ...
7 Age = integer 534= 35 57 19 22 35 28 ...
8 Race = integer 534= category (3 levels)( factor )= Hispanic White White ...
9 Occupation = integer 534= category (6 levels)( factor )= Other Other Other ...
10 Sector = integer 534= category (3 levels)( factor )= Manufacturing Manufacturing ...
11 Marital = integer 534= category (2 levels)( factor )= Married Married ...

A2 row.names = integer 534= 1 2 3 4 5 6 7 8 ...
(Table 1a)


The first data set, cps_wages,  http://lib.stat.cmu.edu/datasets/CPS_85_Wages, consists of  534 records and 11 variables (Table 1a). This data is used to analyze the determinants of wages in a 1985 population sample, with special focus on differences between the sexes.  Among the predictors of Wage are integer variables Age, Experience, and Education, as well as factors (categorical variables) Sex, Race, Marital, Occupation, and Union.  A major point of interest is whether there’s a difference in Wage by Sex after the other variables have been accounted for.

NewHavenResidential = list 8 (1386248 bytes)( data.frame )

1 totalCurrVal = integer 18221= 109200 107380 80500 ...
2 livingArea = integer 18221= 1480 1792 1239 984 ...
3 dep = integer 18221= 24 11 26 35 13 27 ...
4 size = double 18221= 0.25 0.18 0.36 0.32 ...
5 zone = integer 18221= category (3 levels)( factor )= RS RS RS RS RS RS ...
6 acType = integer 18221= category (2 levels)( factor )= No AC No AC No AC ...
7 bedrms = integer 18221= 3 3 3 3 5 3 2 1 ...
8 bathrms = double 18221= 2.5 2 1 1 2 1.5 ...

A3 row.names = character 18221= 1 2 3 4 5 6 7 8 9 10  ...
(Table 1b)


The second data set, NewHavenResidential, is included as part of the add-on R package YaleToolkit (http://cran.r-project.org/web/packages/YaleToolkit/index.html), developed, not surprisingly, at Yale University. Table 1b describes NewHavenResidential, which consists of 18,221 records with 8 variables on the residential property value market for 2006 in New Haven, Ct. The main dependent variable of interest is totalCurrVal, the assessed property value. Predictors include numerical attributes livingArea (square footage), bathrms, bedrms, size (propert acreage), and dep (depreciation percent) as well as factors zone (residential area), and acType (whether or not the residence is air conditioned). Our task is to investigate totalCurrval as a function of the predictors.

Tweet article    Stumble article    Digg article    Buzz article    Delicious bookmark      Dashboard Insight RSS Feed
 
 Next Page
1 2 3 4
Other articles by this author

Discussion:

No comments have been posted yet.

Site Map | Contribute | Privacy Policy | Contact Us | Dashboard Insight © 2017