Symlinking Your Data


I frequently work with private data. Sometimes, it lives on my personal machine rather than on a database server. Sometimes, even if it lives on a remote database server, it is better that I use locally cached data than query the database each time I want to do analysis on the data set. I have always dealt with this by creating encrypted disk images with secure passwords (stored in 1Password). This is a nice extra layer of protection for private data served on a laptop, and it adds little complication to my workflow. I just have to remember to mount and unmount the disk images.

However, it can be inconvenient from a project perspective to refer to data in a distant location like /Volumes/ClientData/Entity/facttable.csv. In most cases, I would prefer the data “reside” in data/ or cache/ inside” of my project directory.

Luckily, there is a great way that allows me to point to data/facttable.csv in my R code without actually having facttable.csv reside there: symlinking.

A symlink is a symbolic link file that sits in the preferred location and references the file path to the actual file. This way, when I refer to data/facttable.csv the file system knows to direct all of that activity to the actual file in /Volumes/ClientData/Entity/facttable.csv.

From the command line, a symlink can be generated with a simple command:

ln -s target_path link_path

R offers a function that does the same thing:

file.symlink(target_path, link_path)

where target_path and link_path are both strings surrounded by quotation marks.

One of the first things I do when setting up a new analysis is add common data storage file extensions like .csv and .xls to my .gitignore file so that I do not mistakenly put any data in a remote repository. The second thing I do is set up symlinks to the mount location of the encrypted data.

This entry was tagged as rstats symlink data encryption privacy

Sep 17, 2013

A New Calculation for Student Mobility

A brief discussion on the complexity of determining the number of schools a student has attended within a single school year using a minimal set of information.

Sep 13, 2013

Documentation of Business Rules and Analysis

One of the most challenging aspects of being a data analyst is translating programmatic terms like “student mobility” into precise business rules. Almost any simple statistic involves a series of decisions that are often opaque to the ultimate users of that statistic.

Documentation of business rules is a critical aspect ...

Jul 12, 2012

Using the Common Core Data on NCES

My analysis on Nesi’s Notes depended entirely on the National Center for Education Statistics’ Common Core Data. The per pupil amounts reported to NCES may look a bit different from state sources of this information. There are several explanations of this. First, the enrollment counts used to generate per ...

Jul 10, 2012

Ranked Likert-Scale Visualization


See below for more information now that Ethan Brown has weighed in with some great code.

A recent post I came across on r-bloggers asked for input on visualizing ranked Likert-scale data.

I happen to be working on a substantial project using very similarly structured data so I thought ...