The open.contractors database: Tracking Department of Defense spending

Tow Center
12 min readMar 27, 2018

--

by Alex Gonçalves

The US Department of Defense (DoD) enjoys the largest discretionary part of the national budget. But like most government data, the data on its spending at usaspending.gov is plagued by inconsistencies that compromise its usability. It’s hard, for instance, to answer questions like how much money Hewlett Packard made in defense contracts over the past 15 years. Or, what is the biggest contract the DoD has procured as military aid to Egypt, and what was it for? How much money has the DoD spent on beryllium toxicity decontamination, and where did the decontamination happen?

My project partner, Allison McCartney, and I built the the open.contractors database as a free, consistent, relational, and up-to-date repository of Department of Defense contractor data that can be explored with ordinary SQL queries for the use of reporters with some experience in data journalism.

How we built it

When we started in 2016, we cleaned the data and worked hard to impose some consistency on it. Whenever we failed, we flagged the corresponding data points as non-reliable entries.

Moreover, we reverse engineered the original database. The site usaspending.gov exports data as a single, standalone table that brings together information about products, vendors, contracts, government agencies, etc. We broke that table into more than 90 tables that resemble the original government database. That way, it is much easier to query information associated to a specific vendor, product, or government agency.

It is worth noting that the open.contractors database is part of a larger project: open.contractors. In addition to the database, open.contractors is comprised of a web dashboard that allows journalists to analyse, visualize, and interact with contractor data from the DoD. Right now, we are looking for a cost-effective solution to keep the web dashboard up and running.

Meanwhile, we’ve made the database available through Amazon Athena at a very low cost both to us and its users. In fact, after talking to data journalists in prominent newsrooms, we realized that several of them were more interested in directly accessing our database than in exploring it through a web dashboard. They wanted the data granularity and query flexibility afforded by SQL, the standard query language for relational databases.

After some research, we chose Amazon Athena as the platform for open.contractors database because of its good response time and low cost — most queries amount to just a few cents. What follows describes how to configure Amazon Athena to access the open.contractors database, and offers valuable information on the data’s internal structure.

How it works

It is useful to have a general understanding of how all of the query pieces work together. Figure 1 describes the basic pipeline. First, a user submits a query to Amazon Athena, “an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.” Then, Amazon Athena accesses the data stored on Amazon S3, an online storage service.

Figure 1: Basic pipline.

The entire system is remarkably cheap. The database takes up fewer than 10 gigabytes on Amazon S3 and storage costs amount to less than a quarter of a dollar per month, paid by open.contractors. In addition to that, a query on Amazon Athena that scans the entire database costs less than two cents, paid by the user executing the query. The user also has to pay to store the results on their Amazon S3, but this likewise amounts to just a few cents.

Figure 2 shows the previous steps that prepared the Amazon S3 repository above. First, we downloaded all data about the DoD from usaspending.gov. Then we used a combination of Python scripts and database constraints to clean the data and fix some inconsistencies or, at least, flag them. Data is stored on a traditional, relational database (Postgresql) and subsequently exported as CSV files to an Amazon S3 bucket.

Figure 2: Previous steps.

How to access open.contractors database

Here, we outline a roadmap to configure Amazon Athena on your computer to access the open.contractors database. For the steps below, we use an open-source software called SQL Workbench/J. It is easier to use SQL Workbench/J than Athena’s web interface, but it requires some work to set up.

open.contractors database is stored on AWS region US West (Oregon). For that reason, make sure your Athena instance and S3 bucket is also on US West (Oregon). That will significantly reduce costs both for you and for open.contractors because nobody will have to pay for transfer costs between different regions.

Step 1. If you do not have an AWS account, follow these instructions to create one. Then send your AWS Account ID to opencontractorsdb@gmail.com and request access to the open.contractors database. If you do not know where to find your AWS Account ID, sign in to AWS and go to this web page. It is the first field (Account ID). Currently, access to open.contractors database is granted upon request to make sure that everyone is using the right AWS region.

Step 2. If you do not have an AWS access key, follow these instructions to create one. The AWS access key is comprised of an access key ID and a secret access key. They will be important in Step 8.

Step 3. All results from Athena will be saved as CSV files on an S3 bucket. For instance, if you request all government agencies that contracted services with a specific company, the result set will be uploaded as a spreadsheet (CSV) to an S3 bucket. For that reason, you’ll need to create an S3 bucket. If you do not know how to do this, follow these instructions. Just make sure to create your S3 bucket on region US West (Oregon).

Step 4. In case you have never worked with Athena before, sign in to AWS Management Console and visit Athena’s web page. Then click on Get Started (Figure 3, arrow 1). That will be enough to create an Athena environment associated with your AWS account.

Figure 3: Getting started on Athena.

Step 5. Download SQL Workbench/J from this web page and install it on your computer.

Step 6. Download Athena’s driver from this URL. You can keep it anywhere on your local drive but a good idea might be to leave it in the same folder as SQL Workbench/J.

Step 7. Execute SQL Workbench/J. Click on Manage Drivers (Figure 4, arrow 1). Then click on the New driver button (Figure 4, arrow 2) and fill out the fields with the following values:

  • Name: Athena JDBC Driver (Figure 4, arrow 3)
  • Library: Click on the Select the JAR file button (Figure 4, arrow 4) and select the file that you downloaded in Step 6.
  • Classname: com.amazonaws.athena.jdbc.AthenaDriver (Figure 4, arrow 5)
  • Sample URL: jdbc:awsathena://athena.us-west-2.amazonaws.com:443 (Figure 4, arrow 6)

Click OK (Figure 4, arrow 7).

Figure 4: Add driver.

Step 8. Now you can fill out the connection form according to these instructions:

  • Default group name: Athena (Figure 5, arrow 1)
  • Driver: Athena JDBC Driver (the one that we created in Step 7; Figure 5, arrow 2)
  • URL: jdbc:awsathena://athena.us-west-2.amazonaws.com:443 (Figure 5, arrow 3)
  • Username: access key ID from Step 2 (Figure 5, arrow 4)
  • Password: secret access key from Step 2 (Figure 5, arrow 5)
Figure 5: Configure connection.

Step 9. Now it is time to set up the S3 bucket where you’ll save your result sets. Click on Extended Properties (Figure 6, arrow 1). Then click on Create a new entry (Figure 6, arrow 2). Under property, type s3_staging_dir (Figure 6, arrow 3). Under value, type the S3 bucket that you created in Step 3 (Figure 6, arrow 4). Then click OK (Figure 6, arrow 5) and OK again (Figure 6, arrow 6).

Figure 6: Extended properties.

Step 10. Save this web page in your local drive. This is the code that will make the open.contractors database accessible from your AWS Athena.

Step 11. Now go back to SQL Workbench/J, click on File (Figure 7, arrow 1) and then on Open (Figure 7, arrow 2). Choose the file that you have just downloaded in Step 10.

Figure 7: Open script.

Step 12. Make sure your AWS account has already been granted access to the open.contractors database (Step 1). Otherwise, this step will not work. If you have received an email confirming that you have access to the database, click on SQL (Figure 8, arrow 1) and Execute All (Figure 8, arrow 2). After a few minutes, you will have access to all tables on the open.contractors database.

Figure 8: Execute SQL.

Step 13. Now perform a simple test to check that you are able to query the data. Click on View (Figure 9, arrow 1) and Add tab (Figure 9, arrow 2).

Figure 9: Add tab.

Step 14. Now type the following query on the new tab (Figure 10, arrow 1):

select count(*) from opencontractors.contractor;

It is a very simple query that returns the number of contractors in the database. Then click on Execute the selected SQL command (Figure 10, arrow 2). You should receive the response on the bottom panel (Figure 10, arrow 3).

Figure 10: Testing.

After setting up Athena to access the open.contractors database, keep reading to find out how the data is organized in the next section.

Understanding the data

The following requires a basic knowledge of relational databases.

At the center of our database, there is one table: contract_all. Monthly, when a new batch of data is made available at usaspending.gov, we upload the new records to that table. We never remove previous records; we only add them. Each row in that table represents a transaction associated to a contract.

contract_all mirrors the structure of the data published by usaspending.gov. For that reason, you should read the data dictionary provided by usaspending.gov to understand the basic structure of contract_all. Unfortunately, since the quality of that document is less than ideal, you might want to complement your reading with the data dictionary for the Federal Procurement Data System (FPDS). Most data on usaspending.gov actually comes from FPDS.

These two data dictionaries together provide a good description of each data field on contract_all. When building open.contractors, we only added one field to the original data structure: a serial id that serves as a primary key for that table. contract_all also has an original composite primary key comprised by two fields: unique_transaction_id and last_modified_date. Despite its name, unique_transaction_id is not enough to uniquely identify a transaction, as usaspending.gov is constantly updating old records to fix data errors. Those updated records have the same unique_transaction_id as the defective records that they fix, but with a different (and newer) last_modified_date. This is why we need both fields to uniquely identify a transaction.

Since contract_all stores all records (including the defective ones that have been updated), we created a table for valid records only (therefore excluding the defective ones): that table is called contract. Accordingly, unique_transaction_id in table contract only includes the most recent version of any transaction on contract_all (therefore the record with the newest last_modified_date for each unique_transaction_id).

Each field on table contract is usually a foreign key to an homonymous auxiliary or consolidated table. Figure 11 presents a general model for the whole database with contract_all, contract, auxiliary, and consolidated tables. You can find a comprehensive entity relationship model for the open.contractors database here. It was create with draw.io, a Google Drive extension. Please install the extension to make the diagram easier to navigate.

Figure 11: General model of the database.

It is important to understand the difference between auxiliary and consolidated tables.

Auxiliary tables

Auxiliary tables are tables whose content never changes. They are usually referenced by foreign keys on contract and will allow you to query the description associated to codes present on table contract.

For instance, a transaction on contract might refer to a contractor based on country TZA — but where is TZA? A query on auxiliary table country will tell you that country “TZA” (code) refers to “Tanzania” (description).

Similarly, let us say a transaction is related to a product whose code is “1127.” A query on table product_or_service_code will bring the following description for that code: “Nuclear Rockets.”

Consolidated tables

Distinct from auxiliary tables, consolidated tables have to be generated when new records are added to the database. Each time we add a new batch of data to contract_all, we recreate table contract to include the new data. Then, we regenerate the consolidated tables based on table contract. The consolidated tables are an effort to reverse-engineer the relational database behind usaspending.gov data.

For instance, each transaction on contract has several columns with information on contractors (DUNS number, name, number of employees, address, etc.). Accordingly, every time we generate the consolidated tables again, we create a table called contractor that stores all contractors mentioned on table contract.

This is a tricky procedure. Let us suppose that a contractor with 300 employees started working for the federal government on January 1, 2005. In December 2010, that same contractor hired 100 new employees. For that reason, some records on table contract will tell you that contractor has 300 employees (records added before December 1, 2010) while other records will state that the same contractor has 400 employees (records added after December 1, 2010). Which figure should we keep in a consolidated table for contractors? Maybe just the last one? If so, are we not missing data?

Our solution for consolidated tables allows us to keep both figures. We can use our example above to describe this solution. In addition to a consolidated table contractor, we also have a consolidated table called contractor_number_of_employees. Table contractor will have one record for the contractor in our example: a record identified by that contractor’s DUNS number (that serves as a primary key for that table). Table contractor_number_of_employees will have two records: both with the same DUNS number but differentiated by a second column called number_of_employees (in this case, 300 and 400).

Each record on contractor_number_of_employees will also have two ancillary columns that inform the first and last dates that a specific number_of_employees was associated to a DUNS number. In our case, for instance, the record that represents that contractor with 300 employees will have as the first date 01/01/2005 and as the last date a day before 12/01/2010. Accordingly, the record that represents that contractor with 400 employees will have as the first date 12/01/2010 and as the last date a day between 12/01/2010 and today.

There is also a third ancillary column representing the frequency that a certain number_of_employees is associated to a specific DUNS number. For instance, around 34 transactions state that the contractor in our example has 300 employees, while 46 transactions state that the same contractor has 400 employees. To sum up our example, Figure 12 shows the records for that contractor on contractor_number_of_employees.

Figure 12: Example of consolidated records for table contractor_number_of_employees.

Therefore, a query on contractor_number_of_employees suffices to answer the question: how many employees does the contractor in our example have? The result set will be Figure 12 and it shows that it had 300 employees until 11/30/2010 and 400 employees since then.

That is the basic structure for all consolidated tables. They all have those three ancillary columns: first date, last date, and frequency.

Advice for querying the database

The aforementioned data dictionaries and the entity relationship diagram are the reference material for querying open.contractors database. Athena uses a quasi-standard version of SQL. There is a (poor) tutorial on that Athena’s SQL on AWS Athena web page. Athena is based on Presto and Apache Hive. For that reason, both documentations also offer good guidelines to the specificities of Athena’s SQL.

Alex Gonçalves is a fellow at the Tow Center for Digital Journalism and a Ph.D. student in communication at Columbia University.

--

--

Tow Center

Center for Digital Journalism at Columbia Graduate School of Journalism