Cold metal

There is a new Google enterprise to get searchable digitised newspaper archives online. A great idea. (I’ve already had loads of educational fun with the Times archive and the Victorian British press archive that went subscriber only, just when it had completely engrossed me.)

The Google blog page has a link to Google’s press archive search but there’s a warning that you won’t find everything indexed. They suggest some searches.

Not every search will trigger this new content, but you can start by trying queries like [Nixon space shuttle] or [Titanic located]. Stories we’ve scanned under this initiative will appear alongside already-digitized material from publications like the New York Times as well as from archive aggregators, and are marked “Google News Archive.”

This instantly arouses my vapourware bullshit detector. Hmm. Space shuttle. The Titanic. First man on the moon… Maybe they’ve just stuck together a few very standard searches and plan to add lots more information as it becomes popular….. I feel impelled to test it a bit more rigorously.

I try a few off-the-wall searches. I pick the topics solely on the randomish basis that somebody’s mentioned the words to me in conversation today :

  • “Dolph Lundgren” – 4,370 articles
  • “Japanese swearword” – 279 articles
  • “linear algebra” – 3,520 articles
  • “Large Hadron Collider” – 3,370 articles.
  • “Frozen vegetables” – 236,000 articles

Blimey. This actually works really well. I can’t claim to have clicked on more than a handful of links but the ones I did click on were legit.. It’s definitely not vapourware. It’s already damn good.

So, the big test, then. I’m going for my favourite indicator that a human twat-a-tron is at work. “Political correctness gone mad” gets 3,420 print archive hits.
Wait. I run it again, to see if the British press is represented. Just because I suspect that it must appear several times a day, so 3,240 seems a relatively small total. (It’s outnumbered by all the phrases above except “Japanese swearwords” and the consensus of press opinion seems to be that these don’t really exist.)

This time I get a mere 1,550 hits. Bloody inconsistent Google. Plus, the timeline is bizarre to say the least. It claims the first mention was between 1880 and 1559. The next was in 1782, then there’s one from 1805. … I think not. They are making these up. The 1958 ones looks like a mistake as well.

Closer inspection reveals that the “dates” have leaked in from elsewhere in an article. Most examples are huddled around the last 8 years. In fact there’s barely an instance of political correctness gone mad until 1998. It’s only in the past couple of years that the full flowering of the phrase has taken off.

“The PC brigade” (h/t Alun) got 467. Ignoring the dating oddities, these are also clustered around the turn of the century, with a linguistic take-off from 2000.

These numbers are tiny. Ah ha. Google hasn’t archived the Daily Mail. 🙂 (No hits for “the Daily Mail is shit”, h/t Tom Donald)

Look, if they are only going to index serious newspapers, there is going to be no fun in this.

However, they must have archived a fair bit of newsprint crap, because “the Rapture” brings back a stunning 18,300 reports.

First mention is 0 AD 😀