7 worst blog scraper tricks

Hah, I lied. I was using one of the top x ways to get people to look at a post.

(There must be lots of people who think – “Ah, 5 secrets of reaching new customers? That’s an interesting number. Not too many for my poor little brain to take in. But big enough to have some meaningful content. And, secrets? Oooh, I love secrets! I love them so much that I will suspend the inner voice that is screaming out that, by definition, it can’t be a SECRET if it’s on the Internet.” Well, SEO experts seem to assume there are lots of these people. :-D)

In fact, there’s only one evil trick that I’m going to bitch about here.

Blog comment spam goes with the blog commenting territory. Akismet is pretty good at stopping most of it, at a cost of losing a few real comments. Otherwise, if we are too stupid to see the difference between a real person and an advert for some spurious Internet products embedded with links to online pharmaceutical sales, we don’t deserve to stop them. Maybe its some sort of Turing test game spammers are playing, examining our capacity to tell whether we are interacting with a human or machine .

Standard comment spam now half-heartedly tries to convince you it’s from a human by saying things like “I loved reading about Whydontyoublog though I can’t say I understand all of it” (I stupidly deleted a good few of these, just before I decided to write this, so no direct quotes. My bad.) Obviously, if you hadn’t seen the same words in a few dozen spams, and if the “insert blog name here” bit made sense – unlike in our examples – you might let this through.

Other ones are from scrapers. I really don’t care if scrapers take our text and present it in sometimes comically inappropriate situations.

The ones for “jokes” scrapers used to be entertaining in themselves. The blog would have a righteous rant about something and say something like “They must be joking” talking about Uncommon Descent or some such unfunny creationist nonsense. A Jokes scraper would spot the “joking” keyword and put the post on its Jokes sites. If there was anyone reading them who expected a punchline to these posts, they must still be waiting.

Indeed, we would sometimes deliberately used the J word just to see how unfunny an atheist rant had to be to avoid getting scraped. We never found a limit. The jokes scrapers just seem to have withered away.

However, these scrapers couldn’t do us any discernible harm. A few sites with no real content seemed just a waste of somebody else’s Internet bandwidth.

But their next-gen offspring really annoy me. These do the same thing but with a subtle difference. They attribute the post they’ve taken and stuck on their blog to somebody else. These sites say things like “romanianbride wrote an interesting post today” and follow it with a post from my/your blog.

You blink in disbelief a few times. You think “Wow, what an amazing coincidence, romanianbride also had an argument about atheism with her son Xavier’s teacher today and met her fundy neighbour Mr Roberts in the supermarket and had to have her cat CuddlySnowballIII dewormed. And she’s written about it in the exact same words as I did!” (Well, assuming that’s your blog content…)

I started out thinking this doesn’t matter, given that it’s not as if we want our noms de blog used anyway and that it’s unlikely that anyone would visit these scrapers except by mistake.

But it does matter a bit. It matters in terms of Google rank for a start. One of the things search engine spiders look for is “uniqueness of content.” If your blog has the same content as half a dozen scrapers – at least some of which will be phishing traps and/or pathways to online sales of probably fake pharmaceuticals, casinos or pr0n – how are the search engine spiders and spam detecting algorithms supposed to know that your site is legit? Your blog starts to look like just another scraper site. In fact, it doesn’t matter if Akismet bins the spam that announces this to you. It’s already damaged your reputation. Giving them the spurious “authority” of a link from your comments, if you use “follow”, is not going to hurt you any more.

Looking for some subtle way to strike back, I first thought of posting the IPs of every one of them here. Bah, way too naive a solution. I followed up some of these IPs using a map lookup IP program. They were pretty clearly spoofed or wireless drive-by linkages. Unless, that is, there really are people, universities and companies so dumb that they steal and misattribute other people’s content, but still operate their legitimate blogs and websites from those addresses. Not likely is it? So, I won’t post the IPs of these people, who are most likely to be victims themselves.

That being my first thought, I don’t have a second. I will try to find a suitable response though, in my quarter-skilled way and post it here if I ever come up with any reasonable suggestions.
If anyone else knows a solution, please let us know.

Mainly for Mana at Skepticum

Look what Google ads has added to a post with the title Praying for rain and wet t-shirts on Mana’s blog…..
Screen shot of google ad

Maybe your prayers have been answered, Mana/Black Sun/Billy and any other commenters who preferred the idea of an equally effective Dionysian alternative to the standard dull prayers for rain being offered by the governor.

I would have sent this as a comment but I can’t see any image getting past any working comments filter. I’ve cut out a bit of text and the post and left it at actual size so it stays legible.