For a recent MinnPost project, we wanted to scrape court dockets, so I figured I’d break out a python script in the wonderful ScraperWiki. One of my favorite features is that you can schedule a scraper to run automatically. One of my least favorite features is that the limit on automatic scrapers is once per day. We needed something to run every half hour.
It seems that every news hacker is using Django for something these days, and why not? It’s fast, flexible and a major headache to deploy (I’ll expand on this in a later post).
Here’s the site we want to scrape (and a great example of how “open” government really isn’t): Minneapolis court dockets
The models.py file is very simple, containing only the fields we want to scrape.
This should be self-explanatory if you’re at all familiar with Django; if not, I highly recommend the official tutorial.
The scraping script
Using requests and lxml, scraping in python is downright enjoyable. Look how easy it is to grab a site’s source and convert it into a useful object:
Take a look at the source code of the court dockets site, and you’ll see how fun it is to scrape most government sites. The information we want to get is nested in four tables (!), all without ids or classes.
Luckily, one of these tables has an attribute that we can immediately jump to. Here’s the code I’m using to get the contents we want:
The text_content() function takes what’s inside an html element sans html tags, and strip() removes whitespace.
The magic - Django commands
Django commands are scripts that can do whatever you like, easily invoked through the command line:
This is great to keep everything inside a Django project, and the scripts are easily accessible. These files are stored inside your app -> management -> commands (my full path is minnpost/dockets/management/commands/scrapedockets.py). If you don’t have these folders already, create them, but don’t forget to add init.py files. I turned my scraping code into a command called scrapedockets.py - full code below.
Django commands require a class Command that extends BaseCommand and has a function handle(). This is called when the command is invoked.
I wrote an (admittedly) bad function to convert the times into seconds to store them in the database. I believe I went against a general rule, which is to store times in GMT, but I don’t competely understand how Django uses the timezone settings. Help?
Anyway, I end up with variables for each piece of information I want to store. I check if a Case already exists with the same information, and if it doesn’t, I create it and save it to the database. I used python’s slice operator to make sure the court and description aren’t too long (according to the database setup I created in models.py).
The magic, pt. 2 - Cron
To make this worthwhile, we need it to run on its own every half hour. Unix systems make this simple, with a daemon called Cron. If you’re using Ubuntu, here’s a nice guide (other distros will be very similar). Cron schedules scripts to be run at different intervals, and its uses are virtually limitless.
I created a script, scrapedockets.sh, which simply calls the Django command we just walked through.
Don’t forget to make it executable:
I used a crontab on the default user to call the scrapedockets.sh Django command every half hour. Edit your crontab using the command:
Each line is something you want cron to do. Here’s what mine looks like:
Cron will run the script scrapedockets.sh every 30 minutes (any minute value evenly divisible by 30) and log output to scrapedockets.log. I encourage you to look at a guide to see what the structure is.
If everything is set up, your Django database should start filling up with information. Build some views, and show the world what you’ve found.
If you know a better way, please share!
I’m far from an expert, so if you see something fishy here, leave a comment or tweet at @kevinschaul.