Only a pragmatic approach to NLG delivers reliable robot journalism

Every month we send hundreds of thousands of robot written articles to news publishers in Europe and North America. Almost all of them are published directly onto sites and apps without going through any editorial checks, so predictability is key when texts are generated. In order to ensure this reliable language quality we have developed our own flavour of Natural Language Generation, pragmatic NLG.

As the transformation to digital is accelerating due to the current worldwide crisis, we’re quickly approaching a pivoting point in publishing. Publishers are looking for ways to leverage new technology in order to become truly digital — in the way newsrooms work as well as in how they serve readers with their journalism. AI and NLG are key aspects of the new technology of publishing. AI, in particular, is now a topic at every industry virtual event and the focus of a plethora of projects — including a big international collaboration led by the LSE — exploring how it might be deployed to support the transformation into the future.

In contrast, the core of the work at United Robots is focused on the here and now, on delivering automated content consistently and reliably, based on structured and regularly published data sets. We’re not new — we have been doing this since 2015, with Swedish local media group Mittmedia our first publisher partner.

Our robots reflect the fact that we originated in journalism. So what does that mean? Fundamentally, it means that our focus is on language quality, variety and reliability. In order to achieve this, we’ve built our NLG technology on rules and algorithms, developed in collaboration with journalists and based around how they reason when they write.

We also leverage machine learning (ML) for a number of tasks. Machine learning is based on prediction and probability using e g Recurrent Neural Networks. In a language context that means that as sequences of words are passed through the network they are the basis for how the network calculates the probability of what words should come next, as new text is generated. The network “teaches itself” as it grows. While this type of NLG works well for things like summarising texts or driving chatbots, at this point in its development, it’s not ideal when what you’re aiming to do is deliver consistently accurate editorial texts. However, we take advantage of ML in our image selection process from e g Google StreetView, where we’ve taught the system to flag if the property pictured is obscured by e g a lorry. We also deploy ML for tooling when we develop text algorithms, as well as for spell checks. And no doubt we’ll develop new ML based systems in the future.

Building for Variety and Reliability. Our core job is to build NLG robust enough to deliver reliable and accurate texts which can be distributed directly to publishers’ customers. That’s what we do for e g Aftonbladet, Zeitungsverlag Waiblingen and Bergens Tidende. This is a key USP of our content products; they save editorial man hours. We can’t afford mistakes, which is why our NLG tech is based on rules and algorithms. This also means that on those rare occasions when an error does occur, we know why and how it happened and can correct it — extremely tricky to do with language generated through ML.

ML models also strive for the correct outcome (based on the words that have come before). With our pragmatic NLG, the aim is to be able to put in e g the same three data points and instead generate many different texts — i e variation in the outcome. The beauty of pragmatic NLG is that it also allows us to create bespoke texts for each publisher, e g by incorporating editorial style guides into the robots. And of course, while we currently generate text in six different languages, we can relatively easily add more within our structured approach.

Pragmatic NLG has been, and continues to be, developed out of a desire to help publishers with their newsroom challenges right now. While it will evolve over time, it’s already solid enough to reliably and automatically generate and publish hundreds of thousands of texts to publishers every single month.

Share