Welcome to The Bay Citizen tech team's blog. Here, we talk about the messes we're happily making at our end of the office, from open-source Django development to jQuery map mashups to Illustrator hacks and beyond.
Welcome to The Bay Citizen tech team's blog. Here, we talk about the messes we're happily making at our end of the office, from open-source Django development to jQuery map mashups to Illustrator hacks and beyond.
Guest blogger: Zoe Corneli. I’m an online news editor at The Bay Citizen, but I also have a little bit of the tech bug — no pun intended. I’ll be guest blogging here at The Sandbox periodically.
One of the many tasks involved in putting together a news website is figuring out how to categorize your content. Not only do topics help with search-engine optimization, but done right, they are also a critical component of site navigation and user experience.
When The Bay Citizen launched in May, we started out with a manual system, in which editors create a set of topics and assign them to stories by hand. But we were concerned that our topics might not match up with those that are standard on the Internet, or might get too messy as we added to the list over time. We also figured we could save ourselves a lot of work by finding a way to automate the process.
After a little digging, I came upon OpenCalais, a project of Thomson Reuters. OpenCalais uses categories from the International Press Telecommunications Council — the elusive “universal” topics used across the news business and the Web — and generates topics and tags using a free, open-source “extraction engine.” You can try it here — paste in any piece of text and hit “submit” to see what metadata it comes up with.
We ran a test to see how OpenCalais would work for The Bay Citizen. Dan McComas, our lead software architect, drew up a spreadsheet of what topics the OpenCalais system would have assigned to all our existing content. Some of the results were pretty spot-on — the story “Smartphones Flunk for Blind Users” got the topic “Technology_Internet” — but others were just kind of weird: a piece called “The Mai Tai Was Invented in Oakland, Not Hawaii” was assigned to “Sports” (granted, that’s a tricky one — the piece covers a variety of different subjects and makes mention of the Raiders and the Yankees). And a surprising number of stories were given no recommended topic at all.
In the end, we decided that, while interesting, OpenCalais’ categorization wasn’t better than our in-house topic-assignment system. As we’ve produced more content, our library of topics has grown, so that now we rarely need to add new ones. I’m still hopeful that we might find a way to introduce more automation into the process — and to generate and use richer metadata in general — but for now, plain old human judgment is still our best tool when it comes to categorization.
I’d be interested to hear about how other publishers are making use of the OpenCalais technology, or finding other solutions to the categorization question.
Tom Tague
Zoe:
Tom Tague from OpenCalais here.
First - thanks for the experiment and the feedback.
Right now Calais only codes to the top dozen or so IPTC codes - it's a pretty rough categorization. In fact the IPTC coding is really only a very small part of what OpenCalais does - the main thrust is the extraction of entities, facts and events.
You might be interested in taking a look at some of the extracted metadata and using this to refine your codes - or you may be right that for your volumes and use case the right approach is to manually code. A lot of what we're seeing is "assistive" solutions where the system provides suggestions on how to tag and code an article and the editor votes up / down.
Again, thanks for taking the time to explore OpenCalais.
Regards,
Zoe Corneli
Hi Tom, thanks for the pointers. I should add that we found the tags that OpenCalais generated to be useful, and we're considering adding them into our system in the "assistive" way you described.
I'm interested in learning more about how we could make use of the entities, facts and events that OpenCalais extracts -- perhaps incorporating linked data, as discussed at a recent Hacks/Hackers meetup in San Francisco.
Seth Grimes
Hi Zoe. This topic -- automated content categorization -- is one among several topics that are the subject of a conference I'm organizing, Smart Content, October 19 in New York. Check out the agenda, speakers, etc. at http://smartcontentconference.com . We'll have folks there who have done what you're trying to do as well as others like yourself, trying to find content-analysis solutions
Seth Grimes
Seth Grimes
I should add: There you other content categorization / annotation services you might try. They include --
- http://www.alchemyapi.com/
- http://textwise.com/api
- http://openamplify.com/
- http://developer.zemanta.com/
-- off the top of my head.
Zoe Corneli
Thanks Seth!