About five years ago, I went on a sales call that didn’t work out too well. The customer, a large direct marketer, was looking for someone to audit its propensity-to-buy program, and invited several consulting vendors in to review the company’s analytics platform. The talks progressed well as we discussed data integration, database, and query/reporting. But when the focus turned to statistical modeling, I made the mistake of asking to see what they did for graphics and visualization of the data that comprised their propensity models. I was interested in both exploratory visualization, much like I’ve written about in other DI articles, and also diagnostic graphics that offer evidence on the validity of the fitted models.
I realized immediately by the stony silence that my questions had not been well received. Finally, after the analysts seemingly came to grips with the third eye in the middle of my forehead, they opined that the company’s models were quite mature and properly functioning and that they felt no need to visually review their findings. At that point I’m sure they also felt they had no need for my consulting services.
I thought about that call as I started looking at a data set I downloaded as part of the R package AER, Applied Econometrics with R1, wondering how best to combine predictive modeling with graphics. The raw data set consists of 7 attributes and 28,155 records from the 1988 Current Population Survey. Among the variables of interest are weekly wage, region of residence, experience in years, and education in years. A natural statistical inquiry attempts to predict wage from remaining explanatory variables. After eliminating part-time employee records and comprising a few additional attributes, including the base 10 logarithm of wage, I set out to look at the remaining 25,631 records with an eye towards predicting wage from experience and education. I used the R platform for statistical computing for the analyses and graphs.
Figure 1 is a scatter plot of experience versus the base 10 log of wage, with a straight line least squares regression superimposed. The least squares line is the very simplest of regressions, and could just have easily been fit by Excel or any popular BI tool. Though it’s called linear regression, standard least squares can also be used to fit more complicated polynomial and exponential equations, provided the parameters to be estimated appear in the model linearly.
The scatter of experience and log wage certainly increases at first, as indicated by the plot and the fitted line. But there appears to be curvature in the relationship for higher values of experience that is not (and cannot be) reflected by a straight line fit. Figure 2a reproduces the plot and adds a cubic polynomial regression curve.