One of the most challenging aspects of being a data analyst is translating programmatic terms like “student mobility” into precise business rules. Almost any simple statistic involves a series of decisions that are often opaque to the ultimate users of that statistic.
Documentation of business rules is a critical aspect of a data analysts job that, in my experience, is often regrettably overlooked. If you have ever tried to reproduce someone else’s analysis, asked different people for the same statistic, or tried to compare data from multiple years, you have probably encountered difficulties getting a consistent answer on standard statistics, e.g. how many students were proficient in math, how many students graduated in four years, what proportion of students were chronically absent? All too often documentation of business rules is poor or non-existent. The result is that two analysts with the same data will produce inconsistent statistics. This is not because of something inherent in the quality of the data or an indictment of the analyst’s skills. In most cases, the undocumented business rules are essentially trivial, in that the results of any decision has a small impact on the final result and any of the decisions made by the analysts are equally defensible.
This major problem of lax or non-existent documentation is one of the main reasons I feel that analysts, and in particular analysts working in the public sector, should extensively use tools for code sharing and version control like Github, use free tools whenever possible, and generally adhere to best practices in reproducible research.
I am trying to put as much of my code on Github as I can these days. Much of what I write is still very disorganized and, frankly, embarrassing. A lot of what is in my Github repositories is old, abandoned code written as I was learning my craft. A lot of it is written to work with very specific, private data. Most of it is poorly documented because I am the only one who has ever had to use it, I don’t interact with anyone through practices like code reviews, and frankly I am lazy when pressed with a deadline. But that’s not really the point, is it? The worst documented code is code that is hidden away on a personal hard drive, written for an expensive proprietary environment most people and organizations cannot use, or worse, is not code at all but rather a series of destructive data edits and manipulations. 1
One way that I have been trying to improve the quality and utility of the code I write is by contributing to an open source R package,
eeptools. This is a package written and maintained by Jared Knowles, an employee of the Wisconsin Department of Public Instruction, whom I met at a Strategic Data Project convening.
eeptools is consolidating several functions in R for common tasks education data analysts are faced with. Because this package is available on CRAN, the primary repository for R packages, any education analyst can have access to its functions in one line:
Submitting code to a CRAN package reinforces several habits. First, I get to practice writing R documentation, explaining how to use a function, and therefore, articulating the assumptions and business rules I am applying. Second, I have to write my code with a wider tolerance for input data. One of the easy pitfalls of a beginning analyst is writing code that is too specific to the dataset in front of you. Most of the errors I have found in analyses during quality control stem from assumptions embedded in code that were perfectly reasonable with a single data set that lead to serious errors when using different data. One way to avoid this issue is through test-driven development, writing a good testing suite that tests a wide range of unexpected inputs. I am not quite there yet, personally, but thinking about how my code would have to work with arbitrary inputs and ensuring it fails gracefully 2 is an excellent side benefit of preparing a pull request 3 . Third, it is an opportunity to write code for someone other than myself. Because I am often the sole analyst with my skillset working on a project, it is easy to not consider things like style, optimizations, clarity, etc. This can lead to large build-ups of technical debt, complacency toward learning new techniques, and general sloppiness. Submitting a pull request feels like publishing. The world has to read this, so it better be something I am proud of that can stand up to the scrutiny of third-party users.
My first pull request, which was accepted into the package, calculates age in years, months, or days at an arbitrary date based on date of birth. While even a beginning R programmer can develop a similar function, it is the perfect example of an easily compartmentalized component, with a broad set of applications, that can be accessed frequently .
Today I submitted by second pull request that I hope will be accepted. This time I covered a much more complex task— calculating student mobility. To be honest, I am completely unaware of existing business rules and algorithms used to produce the mobility numbers that are federally reported. I wrote this function from scratch thinking through how I would calculate the number of schools attended by a student in a given year. I am really proud of both the business rules I have developed and the code I wrote to apply those rules. My custom function can accept fairly arbitrary inputs, fails gracefully when it finds data it does not expect, and is pretty fast. The original version of my code took close to 10 minutes to run on ~30,000 rows of data. I have reduced that with a complete rewrite prior to submission to 16 seconds.
While I am not sure if this request will be accepted, I will be thrilled if it is. Mobility is a tremendously important statistic in education research and a standard, reproducible way to calculate it would be a great help to researchers. How great would it be if
eeptools becomes one of the first packages education data analysts load and my mobility calculations are used broadly by researchers and analysts? But even if it’s not accepted because it falls out of scope, the process of developing the business rules, writing an initial implementation of those rules, and then refining that code to be far simpler, faster, and less error prone was incredibly rewarding.
My next post will probably be a review of that process and some parts of my
moves_calc function that I’m particularly proud of.
Using a spreadsheet program, such as Excel, encourages directly manipulating and editing the source data. Each change permanently changes the data. Even if you keep an original version of the data, there is no recording of exactly what was done to change the data to produce your results. Reproducibility is all but impossible of any significant analysis done using spreadsheet software. ↩
Instead of halting the function with hard to understand error when things go wrong, I do my best to “correct” easily anticipated errors or report back to users in a plain way what needs to be fixed. See also fault-tolerant system. ↩