New study: A simple tool for building data journalism projects in the newsroom
By George King
A great deal of data journalism work today can be characterized as a process of deriving data from original sources. Depending on the size of the story and project, this process can quickly get messy. Practitioners often take many steps to clean data, and may use a variety of computational tools in a single project.
A new report from the Tow Center, “Muck: A Build Tool for Data Journalists,” explores one way to streamline the problem with a tool for organizing and reliably reproducing data computations as a dataset grows or changes, updating outputs like data visualization or tables of statistical results in the process. Muck, a command line program, was developed after conversations with working data journalists and students specifically to address the unruliness of journalistic data projects.
The problem Muck solves
Muck was born from a research question posed by Mark Hansen, a professor at the Columbia University Graduate School of Journalism: What would a language for data journalism look like?
To explore this question, we hosted a conversation with working data journalists in the fall of 2015. The discussion revealed several priorities among practitioners in the field.
- Chief among them was demand for a system that would be comprehensible not just by professional programmers but by less technical journalists as well. Data journalists often use different tools to process and analyze data than they do to produce the published story or interactive application. This makes the analysis less accessible to nontechnical collaborators because there are multiple systems involved, each requiring expert knowledge.
- Another shortcoming of existing methods is that unless the record of modifications to the data is perfect, auditing work from end to end is impossible. Small manual fixes to bad data are almost always necessary; such transformations often take place in spreadsheets or interactive programming sessions and are not recorded anywhere in the version-controlled code. While content management systems for sharing and versioning documents have existed for decades, as Sarah Cohen of The New York Times put it, “No matter what we do, at the end of a project the data is always in 30 versions of an Excel spreadsheet that got emailed back and forth, and the copy desk has to sort it all out . . . It’s what people know.”
- Furthermore, when multiple team members are involved—constantly adding to and tweaking a set or data—the effort required to maintain correctness of code and, consequently, correct project conclusions increases dramatically.
How the new system works
Broadly speaking, the basic programming task of data journalism can be characterized as a transformation of input data into some output document. Bearing practitioner feedback in mind, such transformations should be decomposed into simple, logical steps. Muck turns this conceptual model into an algorithmic system that automatically connects each step into a pipeline or network. This approach relieves programmers from constantly having to worry about keeping the various steps that compose a project up to date.
Any instance where one step refers to another as a prerequisite is termed a “dependency”; whenever the contents of a dependency changes, its dependent is considered stale or out of date. A whole project can then be described as a network of steps, where each node is a file and the links between them are the dependencies, denoted as file name references. In computer science this network structure is called a directed acyclic graph, and a graph representing dependency relationships is a dependency graph. A dependency graph is “acyclic” because there can be no circular relationships: It makes no sense to say that a file depends on itself before it can be used, neither directly nor indirectly. To see this logic in action, let’s rebuild an existing data visualization.
A quick case study
Suppose you want to produce an article like this recent post on The Guardian’s Datablog, which shows life expectancy versus private health care spending per individual across wealthy countries. (A complete walkthrough that reproduces the article using Muck can be found here.)
To begin, the data is downloaded from the Organization for Economic Cooperation and Development (OECD), in the form of two CSV (comma separated values) files, and the outputs are a table and a chart. The basic steps needed to create the article are then:
- download the health expenditures dataset
- download the life expectancies dataset
- extract the relevant data points into a table
- render the chart using the rows in the table
You could do this work by hand in a spreadsheet, and for simple jobs that is a perfectly reasonable approach. But in this example, the CSV files contain a great deal of extraneous information, and you don’t know which rows to select. One good tool for examining structured data like this is SQLite, a free database program that comes preinstalled on MacOS and Linux. If you first load the data as tables into an SQLite database, you can then make investigative queries until you understand which data you want to select for your table. Once you determine the final query, you will use a Python script to generate the graphic in the SVG (Scalable Vector Graphics) format.
You have now broken your job into several steps. Not including the exploratory queries, your dependency graph looks like this:
The import into the database takes some time, but the downstream steps are instantaneous. A programmer would typically do the import in a shell script by hand on the command line, and then write separate scripts for the table query and the chart rendering (these steps could be combined into a single script, but we chose to write them in different languages). Once you decide to build the product in steps, a subtle challenge emerges: You must remember to update products appropriately as the sources change. This might sound easy, but as the chains of dependencies get more complex, the opportunities for error multiply.
Let’s consider a more elaborate version, where you generate the complete article from a Markdown file, which references the table and two charts (“private expenditures” and “total expenditures”). You also factor out the common code for the charts to a Python module called chart_rendering.py
.
At this point you can appreciate that the relationships between steps are not always linear. For a project of this size, development never really proceeds in a straightforward fashion either — there are several possible starting points, and as you progress you can make changes to any step at any time. You could be working on the final styling of your charts, and then suddenly realize you have a bug in the query. Or perhaps you have been working all week and now that you are finished you want to pull the very latest version of the dataset. In the midst of such changes it can be difficult to tell whether or not a given file is actually stale, and sometimes you just make mistakes. The simplest solution is to rerun everything after any change, but when some steps are slow this is not feasible. This is exactly where build systems like Muck can help by orchestrating updates both correctly and efficiently.
To read more about Muck, check out the full report on CJR or download a PDF at Academic Commons.
George King is a senior research fellow at the Tow Center for Digital Journalism at Columbia University.