Thursday, February 23, 2012

Why multi-lingual generation of content from data is important

Multi-lingual generation of content from data has always been on Narrative Science's road map and has informed the modularization of the core platform. It is only after all of the analysis of the facts, evaluation of their importance, and the composition of the representation that the system generates language. Within this model, generating in Spanish, Japanese, German, etc. is no different than generating in English.

The system is not designed to translate, but to generate in multiple languages.

In general, we are not ready to do this, mostly because of the composition of our client base, but doing so is a matter of puling in native speakers who know how to write in non-English languages to configure the platform for the new language.

Occasionally we are asked why we even care, given the rise of translation services. Along with the theoretical answer, that translation requires hard core natural language understanding to really get things right, we also see wonderful examples in the real world.

My current favorite is from a translation into English of a story in Japanese about Narrative Science and Storify. I have no idea what the initial concept was in the original Japanese, but it was rendered into English as:

Than what was in the automatic translation of English to Japanese over Google, has become a meaningful sentence smoothly through many times.

This is strikingly poetic, but more important, a clear argument that opportunities for automatic generation of multi-lingual are still out there.

Tuesday, February 14, 2012

Generating stories from social media: Getting to the meat of the tweets

The problem with social media is that there is just so damned much of it. No matter how you want to slice and dice it, the shear volume is overwhelming. Unless you are looking at a topic or entity for which there is only a trickle of traffic, there will certainly be more information in the stream than a human can deal with on an ongoing basis.

Curation, that is filtering by topic or keyword, has its role, but the reality is that aggressive filtering using terms, sentiment, authority, and location only has the effect of cutting the hundreds of millions down to tens of thousands, a number that is still unmanageable from a human perspective. And as to readability, short lists are still lists.

The question comes down to what is the goal? That is, what insight do we want to draw from the stream and how do we want to communicate it?

Of course, at Narrative Science, our view is that we want to transform the massive stream of data that flows through the firehose into stories that are human readable and express the insights that are hidden within the stream. In order to do this, we have to track, filter, tag and organize the unstructured stream into a semi-structured data asset that can then be used to support automatic narrative generation.

Our first foray into this work has been to look at the twitter traffic related to the Republican primary candidates. Using a focused data stream, our technology captures and tags the ongoing conversations and then transforms the resulting data into stories. Our first story type is focused on how the candidates are trending and what topics are the drivers behind those trends. Linking the stream to events in the world, the primaries themselves, our engine can produce a daily report that captures a snapshot of where the candidates are and what issues brought them there.

While it is still in beta, we thought it might be nice to provide a peek of what is coming with regard to how we are using an ongoing stream of tweets to generate stories that express the state of the world in a form that is ever so slightly more human.


Newt Gingrich received the largest increase in Tweets about him today. Twitter activity associated with the candidate has shot up since yesterday, with most users tweeting about taxes and character issues. Newt Gingrich has been consistently popular on Twitter, as he has been the top riser on the site for the last four days. Conversely, the number of tweets about Ron Paul has dropped in the past 24 hours. Another traffic loser was Rick Santorum, who has also seen tweets about him fall off a bit.

While the overall tone of the Gingrich tweets is positive, public opinion regarding the candidate and character issues is trending negatively. In particular, @MommaVickers says, "Someone needs to put The Blood Arm's 'Suspicious Character' to a photo montage of Newt Gingrich. #pimp".

On the other hand, tweeters with a long reach are on the upside with regard to Newt Gingrich's take on taxes. Tweeting about this issue, @elvisroy000 says, "Newt Gingrich Cut Taxes Balanced Budget, 1n 80s and 90s, Newt experienced Conservative with values".

Maine recently held its primary, but it isn't talking about Gingrich. Instead the focus is on Ron Paul and religious issues.

It is only the beginning, but we see this as the first step in wrangling the firehose and turning the stream into stories.