This morning @TribData released a visualization of all executions in the state of Texas since Gov. Rick Perry took office in late 2000. The interactive was developed to run along side reporter Brandi Grissom's piece that looks at Perry's controversial past with executions.
So how did we make it happen?
The data comes from the Texas Department of Criminal Justice's executed offenders page, which has an HTML table with each row corresponding to each execution. Each row has basic information for each offender (name, age, race, date of execution, etc.), but also contains two cells with links to other pages, one with more in-depth offender information and another with their last statement.
To scrape the page, we decided to give ScraperWiki a shot. ScraperWiki, one of the 2011 Knight News Challenge winners, is an online tool that makes it possible to write web scrapers in Python, Ruby or PHP in an online editor. Because the code and data is public, it is possible to collaborate with other programmers on the same scraper.
Tribune developer Noah Seger and I teamed up to build this scraper, which created an entry in a SQLite database for each offender with the information in their row.
So what about the two links? The last statement pages fortunately had a consistent layout. We had the scraper pull the page, grab the last batch of text from the table and add it to the database for that offender.
With more than 300 pages represented as JPGs, going through them by hand and typing out the data was not an option if we wanted this done anytime soon. We had to find a way to not only get all of this information as text but to do it as soon as possible.
Luckily, there is a service out there that makes it possible to get a very basic task such as transcribing done quickly. Amazon's Mechanical Turk allows users to set up a workflow that pays workers for the competition on a task. With Turk, we offered 10 cents per completely typed page, and had each page transcribed twice for redundancy. We set this up on Tuesday, and were hopeful it would be done by Thursday. But we underestimated the speed of the workers. By 8 p.m. Tuesday night, it was complete.
By attaching each offender's TDCJ number to the data provided by Turk, we were able to easily merge that information with what we successfully scraped. We returned to ScraperWiki and pulled in a csv of the Turk data, we had a complete database of all of the information found in the executed offender page and were able to go forward with designing what we released today.