The open.contractors database: Tracking Department of Defense spending

How we built it

When we started in 2016, we cleaned the data and worked hard to impose some consistency on it. Whenever we failed, we flagged the corresponding data points as non-reliable entries.

How it works

It is useful to have a general understanding of how all of the query pieces work together. Figure 1 describes the basic pipeline. First, a user submits a query to Amazon Athena, “an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.” Then, Amazon Athena accesses the data stored on Amazon S3, an online storage service.

Figure 1: Basic pipline.
Figure 2: Previous steps.

How to access open.contractors database

Here, we outline a roadmap to configure Amazon Athena on your computer to access the open.contractors database. For the steps below, we use an open-source software called SQL Workbench/J. It is easier to use SQL Workbench/J than Athena’s web interface, but it requires some work to set up.

Figure 3: Getting started on Athena.
  • Name: Athena JDBC Driver (Figure 4, arrow 3)
  • Library: Click on the Select the JAR file button (Figure 4, arrow 4) and select the file that you downloaded in Step 6.
  • Classname: com.amazonaws.athena.jdbc.AthenaDriver (Figure 4, arrow 5)
  • Sample URL: jdbc:awsathena://athena.us-west-2.amazonaws.com:443 (Figure 4, arrow 6)
Figure 4: Add driver.
  • Default group name: Athena (Figure 5, arrow 1)
  • Driver: Athena JDBC Driver (the one that we created in Step 7; Figure 5, arrow 2)
  • URL: jdbc:awsathena://athena.us-west-2.amazonaws.com:443 (Figure 5, arrow 3)
  • Username: access key ID from Step 2 (Figure 5, arrow 4)
  • Password: secret access key from Step 2 (Figure 5, arrow 5)
Figure 5: Configure connection.
Figure 6: Extended properties.
Figure 7: Open script.
Figure 8: Execute SQL.
Figure 9: Add tab.
Figure 10: Testing.

Understanding the data

The following requires a basic knowledge of relational databases.

Figure 11: General model of the database.

Auxiliary tables

Auxiliary tables are tables whose content never changes. They are usually referenced by foreign keys on contract and will allow you to query the description associated to codes present on table contract.

Consolidated tables

Distinct from auxiliary tables, consolidated tables have to be generated when new records are added to the database. Each time we add a new batch of data to contract_all, we recreate table contract to include the new data. Then, we regenerate the consolidated tables based on table contract. The consolidated tables are an effort to reverse-engineer the relational database behind usaspending.gov data.

Figure 12: Example of consolidated records for table contractor_number_of_employees.

Advice for querying the database

The aforementioned data dictionaries and the entity relationship diagram are the reference material for querying open.contractors database. Athena uses a quasi-standard version of SQL. There is a (poor) tutorial on that Athena’s SQL on AWS Athena web page. Athena is based on Presto and Apache Hive. For that reason, both documentations also offer good guidelines to the specificities of Athena’s SQL.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tow Center

Tow Center

3.6K Followers

Center for Digital Journalism at Columbia Graduate School of Journalism