Can computers quote like human journalists, and should they?

Quotations in journalistic texts are regarded as word-for-word recollections of what an interviewee has stated. However, there are very little research on actual quoting practices. This is why journalist and scholar Lauri Haapanen decided to focus on quoting in his PhD. In this blog post he will reflect upon how NLG-systems could benefit from knowledge of how journalists actually quote.

When a reader enjoys a story in a magazine, they have no way of knowing how an interview between a journalist and a source was conducted. Even quotations – which are widely considered being verbatim repetitions of what has been said in the interview – might be very accurate, but they might as well be heavily modified, or even partially trumped-up.

A text generator could write a story and a journalist could interview sources and add citations in suitable places.

“For journalists, and their editors, the most important thing is of course to produce a good piece of writing. This means they might be forced to make compromises, since the citations must serve a purpose for the story,” Lauri Haapanen explains.

The Immersive Automation-project focuses on news automation and algorithmically produced news. Since human-written journalistic texts often contain quotations, automated content should also include them to meet the traditional expectations of readers.

In the development process of news automation, it is realistic to expect human journalists and machines to collaborate.
“A text generator could write a story and a journalist could interview sources and add quotations in suitable places,” says Haapanen.

At a later stage, when the algorithms that create texts become more sophisticated, Haapanen suggests software developers also include criteria regarding the selection, positioning, and text modification of quotes.

This is where Haapanen’s research within journalistic quoting practices could be useful. In his dissertation he categorised nine essential quoting strategies used by journalists when writing articles.

Computers must learn to ‘think’ like human journalists in the process of quoting, says researcher Lauri Haapanen.

Based on empirical data, Haapanen found that  when journalists extract selected stretches from the interview discourse, they aim at (1) constructing the persona of the interviewee, (2) disclaiming the responsibility for the content, and/or (3) adding plausibility to the article.

As such, machines should be able to mine these kinds of segments from the source data available.

When journalists then position the selected stretches into the emerging article, they aim at (4) constructing the narration and (5) pacing the structure. When journalists modify the linguistic form and meaning of the selected stretches, they aim at (6) standardising the linguistic form, although they occasionally (7) allow some vernacular aspects that serve a particular purpose in the storyline. Furthermore, journalists aim at (8) clarifying the original message and (9) sharpening the function of the quotation.

Within the scope of the Immersive Automation project we look at how these nine quoting practices can be incorporated in automated news generation.
“After all, computers must learn to ‘think’ like human journalists in the process of quoting,” Haapanen says.

Lauri Haapanen defends his thesis at the University of Helsinki on Saturday March 11. He also appeared on YLE’s radio program Julkinen sana on Wednesday March 8. He has written a blog post for The Media Industry Research Foundation of Finland, the advocacy organisation for the Finnish media industry, and appears in an article in Suomen Lehdistö.

NLG is an essential part of the Immersive Automation research project

NLG, or natural language generation, is a subfield of Artificial Intelligence and Computational Linguistics. Since NLG technology enables the automation of routine document creation, it is an essential part of the Immersive Automation project. Mark Granroth-Wilding is a research associate at the Department of Computer Science at the University of Helsinki, as well as one of the experts on the Immersive Automation (IA) team. As he specialises in Artificial Intelligence, and in particular Natural Language Processing, he will define the basics of NLG in this blog post.

“NLG consists of techniques to automatically produce human-intelligible language, most commonly starting from data in a database. It can be thought of as a process of turning a symbolic representation of data into human language,” Mark Granroth-Wilding explains.

“The purpose of the Immersive Automation project is to take natural language generation and news automation further”, says researcher Mark Granroth-Wilding.

The essential idea of the Immersive Automation research project is to create means to produce news in a way that humans cannot do, for example hundreds or thousands of articles all at once. NLG provides the tools to produce language or text in such a large volume.

“You could of course just supply data to audiences in a raw format – without NLG – but we want to present information in an easier, more understandable format.”

In recent years, we have seen a massive growth in the use of statistical methods and machine learning, including in NLG. However, Granroth-Wilding points out that this has not yet been seen in many of the practical applications of NLG.

“This is what makes NLG a hot topic, and this is also the reason why we are looking into this in the IA-project.”

“Our focus is to work out how state-of-the-art statistical NLG methods can be incorporated into real journalistic processes.”

While some forms of news automation have been introduced into newsrooms around the world, the systems have so far been language dependent and template based. This means that the systems rely heavily on human contribution and focus mainly on languages spoken by large groups of people. One of the most widely used systems is Wordsmith, developed by Automated Insights in Durham, North Carolina. Associated Press, among others, uses the system.

Improving the state-of-the-art

Wordsmith would probably be the most prominent example of NLG in automated text production. However, what we want to do here is something even more sophisticated. There are no such examples currently where a system is capable of independently producing highly variable news texts.”

Currently, the automatically produced news is also limited to areas with large amounts of numeric data, such as sports news and earnings reports. The numeric data is easy to combine with text templates. However, the purpose of the Immersive Automation project is to take NLG and news automation even further.

“Our focus now is to work out how state-of-the-art statistical NLG methods can be incorporated into real journalistic processes. Working out how these techniques can be made intelligible to newsrooms, as well as reliable in accurately conveying their source data, is the big challenge that we’re undertaking in this project.”

News production becomes automatic – meta editors are coming

News production is changing as the routine parts of editorial work are being automated. VTT and the University of Helsinki will explore how interesting and high-quality news can be produced automatically, as well as what kind of new user experiences can be offered.

In order to serve the increasingly demanding audiences in multiple digital channels, media houses are trying to automate the most routine editorial work. This way, the editors can concentrate on writing more challenging special stories and giving their audiences opportunities to immerse themselves in increasingly personalized news experiences.

The University of Helsinki and VTT Technical Research Centre of Finland Ltd will research automatic news production where a personalised news experience is enabled by data and machine learning. Hyper locality and audience participation are the key elements here.

“Semi-automatic solutions will be the common practice: the editor will finalise the automatically produced text and define templates for automatic news generating programs. In the future, all editors will be, to some extent, meta editors”, believes VTT’s research professor Caj Södergård.

The degree of automation rises gradually

So far, automation has been trialed in news production by big actors, such as the American press agency AP (Associated Press), with writing analyses of financial statements for example. In addition to financial news, sports news is already automatically produced around the world.

“One can expect that producing other types of news can be automated up to a certain point depending on the availability of data. More demanding journalism – such as leading articles and in-depth articles – will remain the task of human journalists,” states the journalism researcher Carl-Gustav Lindén from Swedish School of Social Science, part of the University of Helsinki.

“The University of Helsinki studies how data science can be applied to news production and its automation. We develop tools based on data mining and machine learning for journalists to streamline their work,” tells professor Hannu Toivonen from the department of computer science at University of Helsinki.

VTT Technical Research Centre of Finland Ltd studies how automatically produced content affects the audience and what promotes and prevents an immersive experience. VTT is also responsible for the demonstration of a news ecosystem and studies new ways to distribute content in cooperation with the technology companies participating in the project.

The main financier of the Immersive Automation project is Tekes through their Media remake program. Other financiers of the project are Media Industry Research Foundation of Finland, The Swedish Cultural Foundation in Finland, Sanoma, Alma Media, Conmio, Keski-Pohjanmaan Kirjapaino, and KSF Media as well as the research institutions.