The Missing Link: An Evaluation of the Data.gov Catalog
My research questions were guided by the discovery that the Federal government shutdown resulted in the Data.gov catalog being inaccessible. Shortly after the catalog came back online I discovered that some of the most popular datasets had broken links dating back to 2017. My central research questions moving forward were as follows: How much of the Data.gov catalog is actually accessible? How has the quality of data changed over the last few years?Through web scraping techniques, analysis, and resulting visualizations this project seeks to evaluate the accessibility of the datasets made available on the website. This project is expected to be completed May 2019.
Skills: Python, Tableau, Data Collection, Data Cleanup and Curation, Data Analysis, Research, Writing, Visualization, Presentation
In order to carryout my evaluation of the site I needed to obtain data related to the current holdings of the catalog as well as any information related to and evaluation of past holdings. A web scraping tool was used to collect metadata for the current holdings of the catalog. A function was written in python to return status codes for links found within the catalog. Data related to evaluting quality of the catalog was gathered from Project Open Data Dashboard which is an open source project aimed at measuring how Federal agenices are “progeressing on implementing M-13-13 Open Data Policy- Managing Information as an Asset”. Tableau was used to visualize the results.
Web scraping through python script proved unsuccessful as the structure of the catalog made it difficult to pull any information using python script. A web scraping tool was utilized to obtain metadata related to informtaion linked within the catalog. The areas of interest included: Dataset Title, Publisher, Corresponding Links to Data. The sitemap that was created with the scraping tool is shown to the right.
Data Cleanup and Curation
I knew that for my analysis I wanted to focus on text to identify dataset names and the corresponding publishing agencies as well as links to available formats. For each entry a link to a format is available without going to the homepage of that dataset. In certain entries there is a link leading the homepage if there are additional links to for formats that did not fit on the entry in the catalog. For example, there may be five links on one entry and an additional link that say “2 more in dataset”. The additional link would redirect the user to the homepage for that dataset to view the links displayed in the catalog listing along with the two extra sets that were not listed initially. The “more in dataset” links were scraped but I decided not to include those sets in my evaluation as there were more links than there were corresponding extra data links and excluding them would not impact the evaluation. After web scraping, I also decided to exclude any files that had an FTP extension in the link; these links require the user to authorize their computer to connect and therefore will always return a “failure to connect” error message. I felt that including these errors could skew the results so i opted not to include those in the corresponding analysis.
A function was written in python to check the validity of links found within the catalog. The function would return status codes in a 200-500 range and later grouped within their respective ranges. For instance codes “201” and “204” would be grouped together in a “200” group signaling that these links worked properly. Below is the function that was used to return status codes.
Project Open Data
To answer my second question, I focused my attention on the previous evaluation of the site that is available from the project open data dashboard. This resource is crucial because the catalog is monitored and evaluated every quarter by a staff member and appears to be the only public facing and accessible evaluation tool for this site. I knew by looking at it that I could get a sense of how the agency have changed over the years in terms of amounts of data and evaluation of how records are maintained and shared.
A snapshot is taken quarterly and then evaluated by a staff member to measure progress in six areas: Enterprise Data Inventory, Public Data Listing, Public Engagement, Privacy and Security, Human Capital, and Use and Impact. Each category has a pre-defined rubric and when evaluated they may receive one of color coded four ratings. Red indicates that an agency has failed to meet the requirements due to serious deficiencies. Yellow indicates that an agency had one or more minor deficiencies. Green indicates that the agency met all requirements for that area. If any agency has gone beyond the measures indicated, they may receive acknowledgement for exhibiting best practices; this is indicated by green with a bold star. An screenshot of a quarterly rating is shown below (left).
This best way that I could think of to make a formal comparison between quarters and years was to create a point scale to translate the marks to. I calculated a max score for a single quarter based on a three-point scale. I decided that this was appropriate because “Best Practice” recognition was given out seldomly and I felt that using a four-point scale might skew the evaluation. I used this scale to code an excel sheet, calculated and divided the totals for each quarter by the max score to calculate a percentage rating for each quarter. The scale and calculated “max score” is shown in the below (right).
Resulting Analysis: Data Accessibility
-40% of dataset links in the catalog are accessible while another 29% of dataset links are broken or return error messages
-The ratio of Federal Govenrment related accessible links has stayed relatively consistent. Though there has been a spike in the number broken, error, and redirected links in the last two years.
-53% of the publisher category “Other Federal Government” has broken links or no URL provided for the datasets listed. This category includes five seperate sub agencies of the the Executive Office of the President.
Resulting Analysis: Catalog Quality Assessment
-The years 2015 and 2016 Federal agencies were meeting 80% of the requirements as laid out by the Leading Indicators Strategy
- 2017 saw a 48% drop from 2016 assesment totals with many agency assessment results being unreported
- As of today, any Leading Indicator assessement results or metadata information from 2018 have yet to be shared publicly
This drastic drop in the sharing of information is detrimental to the intended purpose of the Open Data Policy resulting in a lack of accountability for maintaining open government data
From the numbers reported, I can confidently say that at best the evaluations, if they are being done, are not being shared in this platform or potentially at all. At worst, these evaluations are not being carried out at all. This coupled with the ratios of status code results show that while the ratios of accessible links haven’t changed, there isn’t anyone monitoring the quality of the data. If these evaluations aren’t being carried out then we have no way of knowing how each department is doing or how many people are viewing the catalog, or how many datasets are even available. In essence there is no accountability for what the government is and is not required to share as part of the Open Data Policy. Although there isn’t enough contextual information to make valid assertions as to why the number of data available has decreased or become less accessible it is enough to draw attention to this issue. While some may take open data for granted, others are working very hard to make sure that the public has access to the assets. While the presence of free, open, and accessible data may be taken for granted to absence of it should not.