About

Welcome to The Bay Citizen tech team's blog. Here, we talk about the messes we're happily making at our end of the office, from open-source Django development to jQuery map mashups to Illustrator hacks and beyond.

More The Sandbox

Shane Shifflett

Django/DocumentCloud integration? There’s an app for that!


When it comes to scrubbing important documents for an investigative project, there’s no better tool than DocumentCloud. But when it came to managing those documents through our own Django-powered CMS, as opposed to relying on the DocumentCloud web interface each time, we were out of luck. So we built a bridge and threw the code into GitHub. Now, other newsrooms running Django can fork, clone, and contribute to this project.

WHY

 We were already using our CMS to upload important documents to AmazonS3 so they were accessible to all our reporters but those files were flat and didn't facilitate collaboration.  DocumentCloud has all the features a good newsroom needs when it comes to collectively scouring documents but it wasn't an integrated experience.  A reporter would have to jump back and forth between their newsroom’s CMS interface and the DocumentCloud web interface, writing the story in one and place and uploading and accessing their documents in another. For a reporter on deadline, the document upload could be sacrificed for time.

So we created this app to let reporters upload and access their documents directly through their newsroom’s CMS, saving them time and paving the way for a smoother method of publishing and embedding documents with stories.  Special thanks to Ben Welsh for building the python-documentcloud api wrapper which made this app dead simple to implement.

HOW

1. git clone git://github.com/BayCitizen/django-doccloud.git
2. From the newly created django-doccloud directory: pip install .
3. Put ‘django-doccloud’ inside your INSTALLED_APPS list in settings.py
4. Add DOCUMENTCLOUD_USERNAME, DOCUMENTCLOUD_PASS, and DOCUMENTS_PATH to your settings.py  file

The DOCUMENTS_PATH variable is used to save the file within your system.  Because we use django-storages to save our files to AmazonS3, we map this variable to a function that generates the document’s path dynamically:

 

REFLECTIONS

The only harm suffered while building this app was caused by an innocuous line of code in a file-upload routine that sent me to the depths of Unicode hell.  The code for uploading files to DocumentCloud - a MultipartPostHanlder - used an ASCII string as its buffer to store file contents. This would be fine if our files were opened using Python 2.6’s open() function because by default it reads in a file’s contents as ASCII encoded strings.  Because we save our files on AmazonS3 using django-storages, the classes used to represent files store their contents as StringIO objects.  Thus, a problem: The MultipartPostHandler was trying to append a Unicode string to an ASCII string as it prepared the package that would be sent to DocumentCloud, I recieved this error message:
 

'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)

The fix was simple enough: Update the MultipartPostHandler to append the contents of a file to a buffer that can handle several types of string (in our case, StringIO).  If you’re using the same MultipartPostHandler, you can find the Unicode support updates in the python-documentcloud project. Otherwise, keep Joel Spolsky’s post on Unicode and Kumar McMillan’s Unicode In Python presentation nearby just in case you, too, find yourself stuck in Unicode hell.

NEXT STEPS

Though the app is simple right now I have a few ideas on what it can become.  For instance, it would be great to publish a document inline with a related story using DocumentCloud’s viewer.  John Keefe already has a template for this, ready to be integrated into the django-doccloud app.  Another helpful view for readers would be a searchable list of a site's published documents with their related story links.  The LA Times’s document-stacker app, another app for pushing documents to DocumentCloud, already exposes a feed of recently published documents.  Updating the django-doccloud app to include a similar view and track related content would be butter (that's right, delicious butter).

So leave feedback and ideas in the comments and start forking!

Shane Shifflett
Shane Shifflett is a software developer and reporter who learned how to interrogate data while a story at Northwestern's Medill School. There, he wrote about a drug-addled prostitute's 300th arrest and the unforgiving criminal justice ... View Profile
Add a Comment

Join the Conversation

Not a member yet? Register Now

You must sign in to post a comment.

or