An Experiment in Content Categorization: Human Judgment Still Best, For Now
By: Zoe Corneli
Guest blogger: Zoe Corneli. I’m an online news editor at The Bay Citizen, but I also have a little bit of the tech bug — no pun intended. I’ll be guest blogging here at The Sandbox periodically.
One of the many tasks involved in putting together a news website is figuring out how to categorize your content. Not only do topics help with search-engine optimization, but done right, they are also a critical component of site navigation and user experience.
When The Bay Citizen launched in May, we started out with a manual system, in which editors create a set of topics and assign them to stories by hand. But we were concerned that our topics might not match up with those that are standard on the Internet, or might get too messy as we added to the list over time. We also figured we could save ourselves a lot of work by finding a way to automate the process.
After a little digging, I came upon OpenCalais, a project of Thomson Reuters. OpenCalais uses categories from the International Press Telecommunications Council — the elusive “universal” topics used across the news business and the Web — and generates topics and tags using a free, open-source “extraction engine.” You can try it here — paste in any piece of text and hit “submit” to see what metadata it comes up with.
We ran a test to see how OpenCalais would work for The Bay Citizen. Dan McComas, our lead software architect, drew up a spreadsheet of what topics the OpenCalais system would have assigned to all our existing content. Some of the results were pretty spot-on — the story “Smartphones Flunk for Blind Users” got the topic “Technology_Internet” — but others were just kind of weird: a piece called “The Mai Tai Was Invented in Oakland, Not Hawaii” was assigned to “Sports” (granted, that’s a tricky one — the piece covers a variety of different subjects and makes mention of the Raiders and the Yankees). And a surprising number of stories were given no recommended topic at all.
In the end, we decided that, while interesting, OpenCalais’ categorization wasn’t better than our in-house topic-assignment system. As we’ve produced more content, our library of topics has grown, so that now we rarely need to add new ones. I’m still hopeful that we might find a way to introduce more automation into the process — and to generate and use richer metadata in general — but for now, plain old human judgment is still our best tool when it comes to categorization.
I’d be interested to hear about how other publishers are making use of the OpenCalais technology, or finding other solutions to the categorization question.
