Why we should create a markup language for journalists

1. What we need

As you know, we’re trying to keep articles alive for as long as possible at Le Temps, a Swiss newspaper. That’s why we developed Zombie, a tool that identifies evergreen articles and lets us know when we should republish them. But when we pull an article from our archives, do we need to update it? How much can we change? And how much time should we put into this?

Instead of asking these questions once the articles have been published, what if we could create articles that already contained sections that could adapt to readers’ expectations over time or other criteria? Here, I’m not referring to changes in substance but rather smaller language-related aspects that need to be modified to prevent the text from becoming outdated or irrelevant.

And what if there were a programming language for journalists designed specifically for this purpose?

2. What we already have

Nothing. If you look for a programming language that serves this purpose, you won’t find one. You’ll find a colleague who assures you that the New York Times has created one, but there’s no way you’ll be able to get your hands on it. The idea is not to turn journalists into IT developers – if that were the case, Python or PHP could provide the solution – or to simply create a markup language that merely verifies the layout. I’m looking for something in-between: a programming language that can modify or generate content. If your search was more fruitful than mine, then I’ll happily take any links in comments.

The aim of this micro-language would be to fill the void between articles that are generated entirely by machine (like the articles that were automatically generated based on the results of presidential polls in the USA, or the software that replaced the journalist in charge of local news) and those written and edited in the conventional way (since the 1980s, the writing process has not changed in any fundamental way, unlike the processes for collecting and distributing information). Most journalist team members have some programming skills (at Le Temps, I’d say it’s around 20% of staff) or are interested in learning. This micro-language would be easy to use on a daily basis, unlike real programming tools that are better for long-running investigations to collect and analyze data or in real IT projects lasting several weeks.

3. Some examples

Dates

“The 51st edition of the Montreux Jazz Festival began yesterday”; “The 51st edition of the Montreux Jazz Festival was held last month”; “The 51st edition of the Montreux Jazz Festival ended six months ago”. Any web journalist knows how painful it is to update these opening statements. Most of the time, we simply sidestep the issue: we remove these turns of phrase to avoid having to update the article. As a result, we no longer have these little teasers that help readers find their bearings – instead the readers have to work things out on their own, which I’m sure they don’t like.

But what if we could make the text dynamic, so that it indicated the time that had passed between the date of an event and the moment the article is read? For example:

The 35th edition of the Montreux Jazz Festival ended [[display time since: July 16, 2017]]. It was a huge success.

The text between the square brackets would change as follows:

The 35th edition of the Montreux Jazz Festival ended yesterday.

The 35th edition of the Montreux Jazz Festival ended the day before yesterday.

The 35th edition of the Montreux Jazz Festival ended five days ago.

The 35th edition of the Montreux Jazz Festival ended one month ago.

The 35th edition of the Montreux Jazz Festival ended one year ago.

Conditions

But there’s a limit: it’s difficult to get a programming language to change the verb tense or turns of phrase without it all getting too technically complex. But that’s something else we’re always constantly having to rework when editing digital content: “The Montreux Jazz Festival will start on July 2, 2017. We interviewed festival director Mathieu Jaton just before this year’s event kicked off.” The beginning of this text will need to be reworked if we decide that the article is still of value once this year’s festival has ended and if we want the article to remain appealing.

The following block of text is displayed if the article is viewed before July 2, 2017:

[[display if date < July 2, 2017]] The Montreux Jazz Festival will start on July 2, 2017. We interviewed festival director Mathieu Jaton just before this year’s event kicked off [[end]]. - Jaton: “This is a transition year for the festival.”

And after that date:

[display if date > July 2, 2017] Mathieu Jaton, the director of the Montreux Jazz Festival, spoke to Le Temps just a few days before the start of the festival’s 51st edition[[end]]. - Jaton: “This is a transition year for the festival.”

Geographical proximity

Geographical proximity is a slightly cynical principle – but one that all journalist teams are familiar with. According to this principle, journalists have to speak about things close to home in order to try to appeal to the reader. The French call it “death close to home,” meaning that one dead person within ten miles of the reader has as much news value as ten dead people 1,000 miles away. Without getting into an ethical debate, we can use a bit of technology to make the reader feel closer to the text. With modern web browsers, it is easy to find out the reader’s location quite accurately. The text can then be adapted accordingly:

Hurricane Harvey struck Texas on August 25, 2017. It made landfall at Corpus Christi, some [[display distance between reader.location and “Corpus Christi”]] from [[reader.location]].

“Reader.location” is used here as a way of geolocating the reader.

Result:

Hurricane Harvey struck Texas on August 25, 2017. It made landfall at Corpus Christi, some 200 miles from Houston.

Currency

We often provide amounts in euros or dollars and sometimes their equivalent in Swiss francs, but never the other way around. Readers in France (around 30% of Le Temps’ online readership) will never get the Swiss-franc amount expressed in euros.

On Monday, the French subsidiary of Swiss banking group UBS completed its acquisition of Banque Leonardo France for [[USD 3.2 billion in reader.currency]]

For a reader in Switzerland, the text would be as follows:

On Monday, the French subsidiary of Swiss banking group UBS completed its acquisition of Banque Leonardo France for CHF 3 billion

And for a reader in France:

On Monday, the French subsidiary of Swiss banking group UBS completed its acquisition of Banque Leonardo France for EUR 2.7 billion

[Note for later: the calculation must be based on the exchange rate at the time of publication, even if this distorts the information if exchange rates change significantly down the line.]

Depending on the text, the author will also have to decide whether to display the amount both in the newspaper’s local currency and the reader’s currency (i.e., CHF 20, or EUR XX) or provide the amount only in the reader’s currency. To do this, a condition has to be added:

I paid CHF 5.50 [[display if reader.currency IS NOT CHF]]([[CHF 5.5 in reader.currency]]) [[end]] for my beer. Scandalous!

4. Advanced functions

Information about the reader

We could go even further by using a data management platform (DMP), a tool that collects data on visitors to a site (Tealium is used at Le Temps). Here, there’s a more obvious ethical dilemma: to what extent can you adapt a text to the reader’s (presumed) profile? It is possible to determine a reader’s gender, approximate age, etc. and adapt the text accordingly. A reader who lived through Reagan’s presidency, for instance, wouldn’t need any explanation of certain details of his presidency. On the other hand, older readers might need explanations when we talk about younger celebrities or technologies like Snapchat. So:

On July 26, 2017, Rihanna [[display if reader.age > 20]], the Barbados-born R&B singer who was the first artist to break the Beatles’ record when her songs spent a total of 60 weeks at the top of the US charts, [[end]] dressed in a rather unusual outfit as she visited the Palais de l’Elysée to try and persuade President Emmanuel Macron to finance her humanitarian fund for education.

[Note for later: take into account the reader’s age at the time of publishing rather than at the time of reading.]

Variables added by the journalist

Once journalist teams have gotten used to the initial options offered by this programming language, we could enter phase two, which would involve making the programming slightly more complicated by adding variables. Let’s imagine a text about unemployment in various French-speaking countries. The author wants the text to be reader-centric: the text must be based on the reader’s country of residence, and it’s the journalist-programmer’s job to make sure that all the information refers to it – that’s the point of the article. This could be done by making the unemployment data variable:

Canada.unemployment.March.2017 = 6.7%Switzerland.unemployment.March.2017 = 3.7% France.unemployment.March.2017 = 9.8% Belgium.unemployment.March.2017 = 6.9%

Canada.unemployment.March.2016 = 14%Switzerland.unemployment.March.2016 = 2% France.unemployment.March.2016 = 10% Belgium.unemployment.March.2016 = 5.8%

We could then begin the article with data from the right country:

Unemployment in [[reader.location.country]] stood at [[reader.location.country.unemployment.March.2017]] in March 2017. It was [[difference between reader.location.country.unemployment.March.2017 and reader.location.country.unemployment.March.2016]] compared with the previous year.

This would result in:

Unemployment in France stood at 9.8% in March 2017. It was down 0.2% compared with the previous year.

We could take it even further if we added conditions such as country comparisons:

[[Display if reader.location.country.unemployment.March.2017 > *.unemployment.March.2017]] Unemployment in [[reader.location.country]] is the highest of any French-speaking country.[[end]]

[[Display if reader.location.country.unemployment.March.2017 < *.unemployment.March.2017]] Unemployment in [[reader.location.country]] is the lowest of any French-speaking country.[[end]]

For this to work – i.e., for journalists to actually use this relatively complicated function – the language needs to be quite flexible and easily adaptable to different ways of naming, defining and finding variables.

[Note for later: a standard text would have to be available for situations in which the reader is in an unknown country. For Le Temps, the easiest option would no doubt be for a Swiss-centric approach: any reader in an unknown country would be considered Swiss.]

5. A technical starting point

After chatting with several developers on Facebook, there seems to be a consensus about the proper starting point. The best thing to do would be to program a parser using Python, given that the idea is not to develop a comprehensive, evolved language. PHP could also be used to program a WordPress or Drupal module that would enable any blogger to add automated pieces of information into their text. Part of the code could also be run in Javascript, on the reader’s side. In addition, the tool should help the author write the text by offering auto-complete, help bubbles, etc.

Most of the functions and options that would be of interest are simplified versions of what is already offered by standard programming languages (e.g., Datediff(), now(), if, etc.). The variables are usually provided by a DMP or the browser and are accessible in Javascript. The raw data would have to be used intelligently: if the time between an event and the reading date is very short, it should be expressed in days, followed by months and finally years. The same is true for distances: they should be expressed in hundreds of feet if they are short, then miles, and finally rounded to the nearest ten miles. And so on and so forth.

6. The challenge of archiving evolving content

One of the issues that is a cause for debate in these types of projects is how to archive content that is not static. How can readers compare their reading of a text (e.g., in Facebook chats) if they don’t have access to the same text? And how can researchers quote a text that is always changing? I had a fascinating discussion with Professor Frédéric Kaplan, who heads the laboratory of digital humanities at EPFL, back in 2016, and it brought some answers to my questions. Although the exchange wasn’t directly about this project, it made me realize that these types of questions will keep coming up, even without algorithm-based texts. Here’s why:

Texts are now modified several times after they are initially put online, and this has been the case for several years now. Although I have not kept a precise record, I’d say that at Le Temps we change a text on average four or five times after its initial publication. The figure is much higher for an evergreen text, not to mention live content, which by definition is changing all the time.
Outside factors mean that the reading experience is different for everyone. It will depend on the browser, screen size, internet connection, whether the reader has an ad blocker, etc. As we don’t read the content in the same setting, our reading experience is not the same.
Certain aspects of the website on which the content is published will influence how it is presented online. What ads were on screen when reading? What related content was provided at the end – or within – the article? Have changes in programming altered how the content is presented? Have new elements been added in the text (such as a subscription form for a newsletter)?

So what does Professor Kaplan think about all this? His view is that we should look at archiving in terms of a viewing experience and not in terms of the essence of the text, sort of like the way that you can play an old video game again but can’t archive part of it.

There is an ethical question in all this: should we tell the reader that part of the text was created using an algorithm? On the one hand, I think only experts are interested in this kind of question. But on the other hand, I think that kind of transparency is needed if we want to build readers’ trust in semi-automatic texts.

7. A non-exhaustive list of functions, variables and conditions

Here’s an initial list of options that could form part of this micro-language:

1. [display number of years | days | months since: 16 August 1983]
2. [display if date < 12 June 2017]
blah blah
[end display]
1. [display if reader.location = california]
blah blah
[end display]
2. [display distance between reader.place and los angeles]
3. [if device = smartphone | computer | tablet]
4. [if reader.age < 20]
5. [if gap between publication.date and reading.date > 20]
6. [display CHF 90 in reader.currency]

Reader variables
1. reader.subscriber
2. reader.gender
3. reader.location
4. reader.town
5. reader.currency
…provided by the browser, DMP or the user’s account info.

Technical variables
1. browser.system
2. browser.device

Stating variables: John = “1” or John = man.

8. Your input

For the moment, I haven’t been able to turn my ideas into a real project – they still need to be developed further. There are almost certainly other ways of approaching the subject and other avenues to explore. That’s why I created this blog post: I’d like your help to get things moving, so please feel free to contribute.

My thanks to Pat Jayet, Frédéric Sidler and Stéphane Koch for their input.

Une réponse à “Why we should create a markup language for journalists”

Julien Grange dit :

10 octobre 2017 à 13 h 52 min

Interesting article. Strongly agree with the above as long as customisation stays within certain limits. Would be a shame to bring the Filter Bubble into the newspapers’ content itself. Good luck to your team!

Répondre