I figured that a good place to start delving into data analysis in R is a short post on basic data preparation. So I grabbed a small Bureau of Labor Statistics data set of seasonally-adjusted government hiring figures for 2003-2013. I imported the CSV file with the header parameter set to TRUE to indicate that the first row contains variable names, and previewed the data with the head() function to see that it imported properly:

> jobs <- read.csv("govjobs10yr.csv", header=T)
> head(jobs)
  Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 2003 329 284 287 285 263 301 331 242 257 325 279 310
2 2004 292 310 342 304 288 294 305 299 324 331 344 324
3 2005 336 316 311 320 318 298 349 310 302 306 338 328
4 2006 305 344 375 355 353 358 370 371 397 316 350 323
5 2007 363 373 373 378 378 364 342 389 366 321 342 358
6 2008 326 318 328 308 312 301 290 289 273 297 267 269

I didn’t see any glaring problems (other than the fact that this is arranged as a table rather than raw data, which I’ll address in the next post on basic exploration). However, I did notice that the variable names had capital letters in them. I confirmed this with the names() function, which can be especially useful for checking large numbers of variables:

> names(jobs)
 [1] "Year" "Jan"  "Feb"  "Mar"  "Apr"  "May"  "Jun"  "Jul"  "Aug"  "Sep" 
[11] "Oct"  "Nov"  "Dec"

It’s good that the names are relatively simple, but to make things slightly easier, I converted variable names to all lower case letters before moving on (this will make them easier to work with during analysis):

> names(jobs) <- tolower(names(jobs))

Based on the head() preview above, everything appeared to have imported correctly. But I wanted to check for any missing values before analyzing the data, so I created a table showing me how many values are/aren’t missing values (NA):

> table(is.na(jobs))
FALSE  TRUE 
  142     1

One value is missing. In fact, a closer look at the data reveals that this missing value is for December 2013 (which probably makes sense since it is just one month ago). I might be able to look up the hiring data to fill in this missing value, but I would not know whether the two values were calculated under different conditions, so I will leave the missing value for now. When analyzing this data, it will be important to be mindful of this and to use the na.rm=TRUE parameter when available so that it does not tamper with the results.

Although this will be covered again as part of exploratory analysis, I think it’s also important to use a box-and-whisker plot as part of data preparation. Checking for outliers is a good way to catch mistakes or errors in data before you begin your analysis. One thing I noted earlier with this data is that the first column contains the year because it is a table rather than raw data. The year, however, should not be plotted with other data because the values don’t represent hiring numbers. Thus, when I plot jobs, I do so without the first column by using [,1]:

> boxplot(jobs[,1])
Data Preparation in R - Boxplot

The plot has very little meaning on its own (especially since is collapses 10 years of data for each month into one box), but it’s a great way to note any outliers. There is one extreme outlier in the May box, at what appears to be a value of over 700. To determine what year this might be in, I plot only the data for May by year:

> plot(jobs$may~as.factor(jobs$year))

Data Preparation in R - Outlier

 

The outlier occurred in May 2010. In order to determine whether this number was real or an error, I did a google search for “May 2010 jobs” to see if there were any major spikes in hiring. In fact, May 2010 had a record gain in hiring. Thus, it seems that the outlier is likely legitimate. Later on, a sensitivity analysis (involving analysis with and without the outlier) will come in handy during exploration.

That covers some of the main goals in data preparation. This was a fairly small and clean data set to begin with, but many will not be this way. I’l’l consider covering more advanced data preparation in a future post. Before finishing up, I pose a couple questions to more experienced analysts, and link you to my aRa Github repository if you’re interested in the data, R source, or plots.

Question:

1. I had the convenience of a small data set when determining which value was missing, so I just perused the data to find it. However, this is not efficient when dealing with large data sets, so: does anyone out there know how to pinpoint NA values in a larger data set using R?

2. In what other ways do you conduct data preparation prior to analysis? What R functions do you use during data preparation?

File downloads:
Data, R Source, Plots