What the Robot Reporter learned from 10,000 news tweets

In January 2015 I put together a simple twitterbot called Robot Reporter. The idea was to see if I could keep track of breaking news by monitoring Twitter for the journalists who use it as a way to find images of newsworthy events.

It works a little like this. Something happens, and a user photographs it:

Fire at Resorts World Casino / Aqueduct! @ABC7NY pic.twitter.com/YWYaoMDQ0N
— J.Barnes (@JBar387) May 15, 2016

A journalist spots that tweet, and requests permission to use the image:

@JBar387 Hi, I’m with ABC News. Is this your photo? Can we use it on all platforms and partners with credit to you? Thanks!
— Alex Faul (@amfaul) May 15, 2016

The Robot Reporter picks it up, and retweets the original – on a good Twitter client, anyone following the robot can see the original tweet content and media:

https://t.co/THHcsW8kXR
— Robot Reporter (@theroboreporter) May 15, 2016

It looks for new items every couple of minutes. And, despite a few hitches along the way (see below), it’s proved pretty effective – breaking news stories from the English-speaking world very often appear in my feed before I’ve spotted them elsewhere (even if, as has happened a few times, they’re taking place just down the street in London).

I’d always meant to do a bit of an analysis on what kind of news was picked up, as I thought it would be interesting to see in which cases media organisations source images and video this way. To that end, as well as republishing each tweet, I stored a copy of its text in a database for future reference. It took a while to get around to (I work for a small business, and have a small person to look after at home), but having passed 10,000 tweets I finally managed to run some reports.

Here’s what I learned.

What the Robot Reported

I wanted to perform a rough analysis of what events were most commonly published by the robot. To that end, each tweet was broken down into words. @-names and URLs were discarded, along with punctuation, words of two characters or less (which are unlikely to be lexical) and a few words (and, the, are, etc) that I knew would crop up frequently while providing no insight. The incidence of all the other words was tallied in a (big) database table.

With 10,220 tweets analysed, and after the filtering noted above, the script processed 66,030 words of which 17,953 were unique.

Fire and Emergency

The biggest result was no real surprise. Even a quick glance at the robot’s feed on any given day tends to show at least one thing burning, and “fire” was by far the most common word reported (1,123 instances).

Also common were “police” (254) and “car” (253), “street” (179), “people” (172), “house” (156) and “accident” (120) – all items that probably related to the everyday crime and emergency incidents that make up most local news – but none of them came anywhere close to the interest in fires.

Crime and Mortality

When it comes to crime, there’s a depressingly high number of “shooting” (52) related words – far more than “stabbing” (12) – six “shooter” or “shooters” and seven instances of “lockdown”.

The last 16 months saw several bombings and “bomb”-related incidents (including the discovery of unexploded munitions, etc), resulting in 62 entries. There were also 70 “explosion”s – not all of which were acts of terrorism or crime.

Across the whole database, 20 entries listed people “killed”, and 28 “dead”. Either could be referring to one or multiple people. There were 3 instances of “murder”.

Peace and Love

There were 68 instances of “love”, 25 of “lovely”, 50 of “happy”, and 3 of “yay”. 69 people said “thanks”.

Weather Events

As you might have expected, the robot recorded a lot of interest in the weather. Snow leads the way at 159 (along with “snowing” (28) and a million variants – I like “snowzilla” and “snowmageddon”), followed by “storm” (119) and “hail” (97). There were 31 rainbows, but only one “magicalrainbow” and one “doublerainbow” (and one “rainbowbagel”). Oh, and while we’re on uncontrollable natural events: 13 tweets including “earthquake”, but 109 for “flooding” and 36 for “flood”.

If you total up all the variants of the most common weather event words you get:

Apparently rain isn’t really worth tweeting about (33 instances).

Waiting in Line

And people like to tweet while they wait: 129 entries were American-English speakers (presumably complaining) about a “line” or “lines”, while here on rainy-misery-island we recorded 15 instances of “queue” and one “verybritishqueue”.

Sarf-East

For my fellow SE-Londoners: 5 mentions of Lewisham, 2 of Eltham, 1 of Hither Green – and 13 of Catford (for comparison: Brooklyn got 16).

WTF

Lurking in the low-incidence words: balhamgiantfoot (see below), “mctrainface” (first name “trainy”) and probably many more. Oh, and 3 instances of “wtf” itself.

.@MPSWandsworth More intel for your investigation on the #BalhamGiantFoot pic.twitter.com/j5ZK8SzASe
— Matthew Page (@snapstamatt) April 25, 2016

View the Data

Obviously this is just a very brief overview of the data – there’s probably much more to be found. If you want to have a look, I’ve put a snapshot of the word incidence data, as it stood on 15 May 2015, here – if you find anything interesting in there, please let me know.

What Went Wrong?

For a project that started life on a whim, I’m pretty happy with how The Robot Reporter turned out – it’s proved interesting to watch. But that’s not to say that it hasn’t been without its problems. In particular:

Linking to Sources

The original bot posted the content of the source tweet (credited), rather than a link to it. This meant that the media showed more reliably in all clients, but also – after a couple of months – got my API key blocked by Twitter for a ToS violation. A dumb mistake, but easily fixed, and the current version is in compliance.

A Reporter with no Ethics

More importantly, I learned the hard way that the bot was only as good as the journalists it followed. When really horrible events occurred, some journalists would request publication rights for images that were, frankly, horrific – the kind of thing that, as a journalist, I’d assumed nobody would think of publishing. And, because the bot has no ability to make an editorial judgement of its own, it would republish them too.

The first time this happened, I just stopped the whole thing while I worked out what to do. The current software has a kind of safety switch built in – in the event of something terrible happening, I can suspend publication from a web interface. While this switch is active all news-y tweets are logged, but not published.

What Next?

As of today, the Robot Reporter has been keeping busy for 16 months, and has reported around 10,200 tweets. I’ve recently moved the database over to Heroku Postgres (boring tech section below, if you’re interested), allowing me to tweak it more easily, and updated the code for the first time in ages, so it should be good to keep running into 2017 at least.

Over the next 12 months, I think it’d be interesting to monitor the journalists working on Twitter as well as their sources. The bot will keep tweeting as it is, but I plan to capture the tweets requesting publication rights as well as their sources – in six months or so we’ll see if this has turned up anything interesting.

In the meantime, you can follow the robot here.

Postscript: Boring Tech Stuff

For anyone interested in how TRR works, here’s an overview. It’s one script, of about 100 lines of PHP, running on a Heroku web dyno (hobby class). It uses the TwitterOAuth library to search and publish to Twitter.

The main “report” script is called every few minutes via a CRON job on one of my servers. It looks for tweets with a certain pattern – asking for permission to reproduce a photo or video – and then filters them to include only tweets that are replies. If the tweet to which a particular request is replying contains media, it’s a candidate for publication.

Similarly, a new “index” script is triggered regularly, processing a maximum of ten tweets at a time into words – so the word indexing may lag behind a little in really busy periods, but soon catches up. I should really rewrite this as a worker dyno task.

A database table holds a history of tweets it has published (tweet id and text), and there’s also a record of the last tweet ID its search turned up, to avoid duplication. This was originally a MySQL database on Appfog v1, but it’s now in Heroku Postgres. The “killswitch”, which can stop publication if necessary, is a Heroku config variable.

Running the whole thing costs about £10 each month. I’m still not entirely sure why I’m doing it, other than that it’s kindof interesting.