New study: A simple tool for building data journalism projects in the newsroom

The problem Muck solves

Muck was born from a research question posed by Mark Hansen, a professor at the Columbia University Graduate School of Journalism: What would a language for data journalism look like?

  1. Chief among them was demand for a system that would be comprehensible not just by professional programmers but by less technical journalists as well. Data journalists often use different tools to process and analyze data than they do to produce the published story or interactive application. This makes the analysis less accessible to nontechnical collaborators because there are multiple systems involved, each requiring expert knowledge.
  2. Another shortcoming of existing methods is that unless the record of modifications to the data is perfect, auditing work from end to end is impossible. Small manual fixes to bad data are almost always necessary; such transformations often take place in spreadsheets or interactive programming sessions and are not recorded anywhere in the version-controlled code. While content management systems for sharing and versioning documents have existed for decades, as Sarah Cohen of The New York Times put it, “No matter what we do, at the end of a project the data is always in 30 versions of an Excel spreadsheet that got emailed back and forth, and the copy desk has to sort it all out . . . It’s what people know.”
  3. Furthermore, when multiple team members are involved—constantly adding to and tweaking a set or data—the effort required to maintain correctness of code and, consequently, correct project conclusions increases dramatically.

How the new system works

Broadly speaking, the basic programming task of data journalism can be characterized as a transformation of input data into some output document. Bearing practitioner feedback in mind, such transformations should be decomposed into simple, logical steps. Muck turns this conceptual model into an algorithmic system that automatically connects each step into a pipeline or network. This approach relieves programmers from constantly having to worry about keeping the various steps that compose a project up to date.

A quick case study

Suppose you want to produce an article like this recent post on The Guardian’s Datablog, which shows life expectancy versus private health care spending per individual across wealthy countries. (A complete walkthrough that reproduces the article using Muck can be found here.)

  • download the health expenditures dataset
  • download the life expectancies dataset
  • extract the relevant data points into a table
  • render the chart using the rows in the table
Health costs example: conceptual dependencies.
“Source” dependencies show that a given product is produced by running the pointed-to source code file. Muck determines these relationships automatically via its naming convention. Each “data” dependency exists because the given source file opens and reads the pointed-to data file.
Muck implementation to build a complete web page. In addition to the source and data dependencies seen in the previous diagram, this implementation also features a code module that is shared by two scripts.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tow Center

Tow Center

3.6K Followers

Center for Digital Journalism at Columbia Graduate School of Journalism