On September 30th at AOL headquarters in Greenwich Village 200 people gathered to learn how high traffic information companies manage their vast stores of data. The subject was news. The lessons are applicable to all organizations with terabytes of content moving through their platforms on a daily basis. Pay attention if you’ve got a pile of data to sort through, whether you need actionable insight at a glance or must find the most recently documented pieces of a competitive intelligence puzzle.
The questions raised by the organizers were classic knowledge management questions with a news industry twist:
First, when faced with a giant pile of primary source material, how does a reporter intelligently and efficiently discover the newsworthy bits? Second, how should the organization index and expose the latest news and archival material to both consumers and reporters? (http://www.meetup.com/semweb-25/calendar/14317064/ Thursday, October 7, 2010)
Stuart Myles, Deputy Director of Schema Standards at the Associated Press began the evening sharing his experience at tackling these questions. His slides are available (http://prezi.com/dfznd7q6775-/information-management-at-the-associated-p...) but his insight is more compelling. There is still a need for a hybrid human-computer system in news classification and delivery. Three components surfaced during his talk: the need for classification schemas & vocabularies, the need to apply them to content and the need to publish them with the content.
An Information Management team and platform work diligently to manage the AP’s metadata - which is every bit of information that is not the story proper: the byline, date, slug, format, media type, language, subject and more. Content is authored and collected, enriched by means of the AP’s classification system and stored for access and distribution.
The importance of classification schemes is evidenced by the scope of the AP’s schemas and vocabularies. Subjects, entities and geography form the three primary branches of the AP taxonomy. 4200 terms are collected in 16 subject categories. Over 70000 people, 34000 companies and 500 organizations are represented in the entity authority files, along with attributes for these such as relationships, addresses, and uniquely identifying indicators. Over 2000 geographic entities along with properties such as type, latitude and longitude round out the terms. Not to be forgotten are the technical and administrative metadata elements that indicate Media Type, Source, Copyright, Permissions and the like that drive the compilation of content into products. The tags are applied according to rules such as those published by the IPTC and the hNews microformat to improve precision in information retrieval, navigation and discovery.
Tom Torok of the New York Times stepped up next, sans slides, to differentiate news collection with news aggregation and dissemination. We were reminded that it’s simply a new approach to an age-old method: shoe-leather reporting, legal pad assisted reporting, calculator assisted reporting, telephone assisted reporting and now Computer Assisted Reporting (CAR). He remains surprised that 26 years after Time named the PC as “Man of the Year” we are just getting to CAR.
Tom was first introduced to a digital computer via punchcards at the Philadelphia Enquirer in the early 1970s. For the remainder of the decade, journalists at several papers who were interested in using machine power had to beg time on company mainframes after hours to do more than word processing. When he finally purchased a PC - an Osborne with Supercalc and dBase II (and an offer to upgrade to a 10MB external drive for $10k!) he learned that newsrooms were asking their computer consultants to not “distract” the journalists with software other than word processing. These same journalists began asking the administrative assistants if they could ‘borrow’ their machines after hours to use spreadsheets. Hostility towards the power of PCs remained through the 1980s, even after Bill Dedham won the Pulitzer using a PC to analyze data and report on a pattern of racial discrimination in the Atlanta housing lending market.
Most reporters had a liberal arts background, understood the power of a computer and wanted to make use of them, but were a bit lost as to how. Many taught themselves how to use the machine, learning Fortran and piecing the knowledge together from books and dialup BBSs. When Tom first began teaching student journalists how to use the computer to aid in their research and analysis, he was asked to do it without using math! Fortunately, thinking changed. In 1994 a group called Investigative Reporters and Editors was digging into a story on one of their own - a reporter who had been investigating the mob in Arizona. The group used computers to analyze their data, and with the help of the school of journalism at Missouri began NICAR (National Institute for Computer-Assisted Reporting) - an organization dedicated to researching and educating reporters on how to use computers in their work. Conferences, online resources, listservs - all served to answer questions and encourage new and improved ways of leveraging the power of the machine.
Why is it so important? As Tom said, “you can’t pick up the phone book to find a crook.” A great deal of structured and unstructured data was always available, but it was a ridiculously lengthy process or nearly impossible for a human to analyze it all. Now the reporters are acquiring data sets and using NLP (natural language processing) tools and techniques to analyze unstructured data. One demo Tom gave was of the Times’ use of FastESP. Another was of OmniPage Capture SDK for OCR. They are writing simple scripts and programs to normalize the files small and large so they can begin asking questions and visualizing data, as they do with IBM’s “Many Eyes” tool.
What kinds of questions have they answered? Tom tells us of a few compelling examples: Jo McGinty published “Breaking Down Hate Crime” which was an analysis of hate crimes in New York. Adam Liptak examined campaign contributions. Another example was of the difference between pages whose visual data was scanned vs. the metadata: more complete details of Hilary Clinton’s appointment calendar were available to the machine than to the human eye - including names, addresses and phone number which were redacted in print. (Corporate entities take note - you need to think about what metadata you are publishing!)
Tom has seen a great deal of change in capability and speed thanks to the PC, and it was a fascinating glimpse into the lives of reporters looking to uncover the facts.
Maurice “Mo” Tamman of The Wall Street Journal stepped up to share his experiences as a “2nd generation” CAR guy. He came to the practice in the 90s as a way to validate many of the tips, hunches and anecdotes reporters were basing their stories on. He is a strong believer in instinct - a gut feeling - but appreciates the value of having those hunches backed up by solid data analysis. Sometimes the hunches aren’t quite right, and the patterns a machine can elicit from a pile of data make all the difference in following a lead.
His first example was of the voting machine troubles seen in Florida. Several years back there was a “nasty, dirty, fun” election for a seat being given up by the incumbent, Katherine Harris, who had chosen to run for Senate. The race was decided by a small number of votes, and it was noticed that there was a 15-16% undervote in that race as opposed to the typical 2-3% undervote across the ballot. After the grief of the 2000 Presidential elections there was concern that there was trouble with the process or the machines brought in to improve it. Voters were interviewed and the machine data was obtained. It was in a very messy ASCII file (anonymized of course) but it didn’t list which races were skipped. So the team of journalists used their computers to recreate every ballot, putting in holding places for the races a voter may have skipped. After normalizing the data, they could accurately say what any given voter had done. But they still couldn’t figure out where 18k votes had gone? They checked for correlations among race, age, gender, machines, early ballots vs. election day votes. The only, and admittedly WEAK trend they saw was a slight skew towards precincts where voters were older. Being Florida, they tossed it. All of that work recreating the data in a structured, analyzable format, and they had nothing. So they trusted their guts - went back to being reporters instead of data analysts. They decided to score each voter based on if they voted Republican or Democrat. They found that loyal voters don’t miss races - except this one, where they didn’t vote at all. They also noticed counties with similar voting equipment had similar layouts for the ballots and there was a correlation - the votes changed depending on where the page broke. If the page broke in the middle of the race, above it, or below it, the pattern was different. To test the theory, they rebuilt the machine and the ballots and put it in Boston, asking random people to ‘vote’ in their fake election. The folks in Boston missed the same race as in Florida.
Is this merely speculation or interesting insight into Human-Machine Interaction or Usability/Interaction Design? Computer Assisted Reporting in this example brings to mind the notion of a hybrid system being best: use the machines to do what they do best - normalize data, identify connections, visualize patters; but use the instincts of a human to present the final analysis. No amount of rules-writing can recreate the human mind. CAR and “shoe-leather” reporting can intersect quite nicely.
Mo gave several more examples of using server log data to identify web-surfing behavior in schools and anticipating bank stress test results. His presentation as given to the Knight Digital Media Center is available and contains many examples of his process. He referred to his work as Empirical Journalism - working on an empirical spine. Trusting his gut to identify leads and choose paths to follow, but backing up his gut with re-creatable data analysis, which lends greater credibility.
The evening’s presentations ended with Justin Cleary, a member of the product team at AOL News and our host for the event. AOL’s team is small compared to the other organizations presenting, but their techniques are quite similar to those enabled by Stuart’s team and the teams behind Tom and Mo at their respective employers (by way of disclaimer, I worked at Dow Jones and have consulted for the New York Times.) They must be even more efficient with their resources but their effectiveness is clear, as they claim the 4th largest news site on the web, averaging 30M unique visitors a month. AOL News provides a blend of original reporting, news blogging and wire service aggregation and distribution.
Justin described a new type of process they are using dubbed “Surge Desk.” Story assignments originate from trending topics and search data and focus on delivering coverage of these to users as they demand it. The process starts in the writer’s mind.
First, the writer must choose a topic. They can consider what’s happening today, what’s trending, what consumers are looking for, and what’s left over after those categories of information are examined. AOL provides staff with this data primarily from their search logs, looking to highlight trends and relationships rising in the logs.
Next the writer must create and both humans and machines must classify. What are the top level categories? You can see these on AOL’s site, subjects such as Health, Science, Entertainment, etc., all editorially selected. What tags apply? These are suggested based on a semantic analysis of the content. Taxonomies, entity extraction and rules-based classification are used to apply tags to the content.
Next, it is noted “which questions does the content answer?” The new content is compared to the search database, and relevant queries are tagged with the new content, such that when a similar search is performed, the new content is suggested as a page to view in the search results. The content, tags, and suggestions are published to the news site.
Finally the team tracks the performance of their content and classifications. In addition to real-time standard web metrics, the also track incoming searches and the page selection related to each - are the tags accurate to the users understanding of them. Meaning - when a piece of content is provided in a search result did it get clicked on because of its metadata or a pattern match against its full content? How accurate is the tagging? They also use non-search external referrals to assess the quality of the tagging. It’s important to analyze what users are searching on as well as what they’re clicking on - it’s not always the same.
Importantly, Justin shared with us the answer to the critical question - does it work? They believe the efforts are justified: the volume of content produced is up approximately 20%, and they’ve created thousands of long tail pages, increasing the opportunities for ad-based revenue generation. Natural search referrals are up “significantly.” They are pleased with the results.
The evening rounded out with Q&A. The gentlemen wished that were was more and better datasets available. They grumbled, perhaps justifiably, that many of the datasets being released by the US Government’s various data projects are weak and simply a means of checking off a “must-do” box. Various levels of attention are being given to formats such as RDFa, a topic seemingly of greatest interest at the AP and IPTC. Greater semantic awareness among data topics is desired - great news for the ontologists in the audience! Better integration of metadata into search and user interfaces is desired. The usability and design challenges of contextual awareness are great, and still need smart thinking. Finally, improving the means by which smart humans and smart systems can work together is desired.
A great evening was had by both the Hacks and Hackers and Lotico New York Semantic Web communities. Practical knowledge and real-world use cases were provided by the speakers, and all is applicable to the newsroom AND the boardroom. If you have the chance, take a look at the presentations that are available, and then get your favorite journalist, hacker and ontologist in a room - you may be pleasantly surprised at what insights you can gather.