Bloggery

Come back flatterspam, all is forgiven.

For the past few days, the blog has been getting gibberish comment-spam, in oddly large numbers, almost at DDOS attack levels. (OK, I exaggerate but there were over 380 yesterday, 51 today.) Some of these comment spams are particularly weird, in that even the URLs are gibberish.

It’s not as if the random word generators have generated text in any known human language, that could trick the unwary into clicking on a link to onlinefakemeds.com or whatever. The URLS themselves are also random letter collections, with names like Mr._Mxyzptlk but less meaningful.

Charitably assuming that spammers have not completely taken leave of their senses, I guess that these suidfiojdfolsrkl.comstyle links go to redirects and do eventually take the unwary URL-clicker somewhere. (Obviously, I’m not going to try them out. I’m enough of a sucker for any worm or trojan anyway.)

But still, what is the point? It seems even less likely that people would click on a gibberish link in a mound of gibberish than that they would believe that a complete stranger in Africa would pay them ten percent for the assistance in transferring 64 million dollars.

A few more blog-related odds and ends, now I’m on the subject:

Apologies to anyone who expects to get email alerts about new posts here. This plug-in has just stopped working. We don’t know what happened so we have less than no idea how to change it.

The Atheist blogroll got broken so long ago, it’s almost a distant memory. Again apologies. We threw it away a few months after it got stuck permanently showing last August’s posts of about ten blogroll members. (Or something like that.)

Other things just randomly break anyway. For instance, there was a link to the Convention on Modern Liberty that only lasted a week or so.

Plus, this blog can load so slowly (even on my allegedly very fast connection) that It’s hard to see why anyone bothers waiting for it.

Except for all those visitors who are looking for Schwarzenegger, 5 fruit and veg, funny magic the gathering cards, Bodium castle, fairytale castles, fine art or morris dancing. These are the top search terms that consistently bring people here from Google. Every day.

Now, I am all for giving the public what they want, but there’s only so much that I have to say on any of these topics. So, most of these visitors must leave a little disappointed, to put it mildly.

This blog needs a serious “REDO FROM START.” It should happen soon…..

Cuil runnings

Cuil, Cuil ffs? Repress a shudder at the name. It’s a (relatively) new search engine. It’s good, although it’s had a bit of a critical drubbing. It’s much prettier than google. Its results make a lot of sense. It’s not stuffed with sponsored links or spam links or dominated by top-ten-authority corporate results. So I think I like it, although I’ve only used it on test basis.

I also really like Ubuntu. Of course, any Linux version is admirable. and Ubuntu is more admirable than most.

I am just going to have a pointless rant about the branding – calling things ethnic-sounding names to make perfectly good and worthy things sound just that bit more credible.

The wikipedia entry doesn’t do much to disspell any impulse to sneer at the Cuil name:

The Irish ancestry of Anna Patterson’s husband Tom Costello sparked the name Cuil, which the company states is taken from a series of Celtic folklore stories involving a character called Finn McCuill. The company says that Cuil is Irish for knowledge and hazel.

That’s “Irish ancestry” in the sense of “American Irish”, then? (One Irish great-great grandparent and an Irish surname qualify any American as Irish. Although I remain to be convinced that Costello really counts, here….)

Wikipedia does some serious undercutting of the legitimacy of the Irish ethnic explanation for the brandname, from a standpoint of linguistics. Which feeds my instinctive prejudice against the word, the spelling and its supposed “cool” pronunciation.

I used to get riled every time I saw claims that Ubuntu was the “African word for” something, as if Africa didn’t have more languages than any other continent in the world.

Ubuntu is an African word meaning ‘Humanity to others’, or ‘I am what I am because of who we all are’. The Ubuntu distribution brings the spirit of Ubuntu to the software world. (from Ubuntu.com)

I have to turn my pedantry against myself. That said “An African word for” not “The African word for”. Maybe I have been misjudged Unbuntu. I do a cuil search for “ubuntu is african for.” The first page is whole string of official ubuntu links, none of which say it is the African word for anything. In fact, many of the definitions that turn up are reasonably precise, a Zulu word and a South African philosophy.

My bad. I must have imagined the “African word for” phrase, misremembering the blurb from the old distro I have somewhere.

But google and cuil do both unveil an apparent subgenre of geek humour based on the misremembered “Ubuntu is African for”

Ubuntu is African for ‘Can’t configure Debian’. (typical link: Ubuntu forum post)

Indeed. ubuntu is african for ” I CANT CONFIGURE SLACKWARE”
(typical link: Another forum)

ubuntu is African for “time sucker”, right? (link: I-phone blog forum)

Ubuntu is African for “struggles to install mouses”. (from information rain)

Most off-the-wall is
Ubuntu is African for sharks with freaking laser beams on its head. (from animetro)

Am I beginning to see a pattern, here? I’ll have to try it.

Cuil is Irish for “excuse to use a disgustingly lame pun in a blog title”

(Sorry.)

Bodiam Castle? Google Is Your Friend…

I have been looking through the website logs to see just what it is that drives people to this site and, while lacking in raw comedy value (unlike some), it has been interesting.

Running a combination of Firestats, Feedburner and Google Analytics it seems this blog is getting around 400 visits a day. From these around 80% are new (which shows just what a non-loyal readership we hold…) and of those around 70% come here from a search engine – nearly all from Google. For the numbers-fans, this translates to about 200 hits a day from Google searches. Given the insanely varied nature of topics here, you would be excused for thinking this was reflected in the search stats. Not so.

Of the top ten search terms used to come here, seven are image searches, and this accounts for about 90 of the incoming hits. Even stranger, of these over a third are all searching for images of Bodiam Castle.

Now, Bodiam Castle is a gorgeous, fourteenth century fairytale castle in East Sussex, run by the National Trust, so I can understand why people are interested in it. In fact, I understand this well enough to have uploaded another photo!

Bodiam CastleIf you have come here searching for Bodiam Castle, I hope you like this, and you can even see more on Flickr. It has been a long time since I have been to Bodiam so please, forgive me for the photos being out of date now. If you have links to other pictures of this gorgeous castle, please let me know and I will be more than happy to link to them from here.

Back onto the search topic, there is the determination issue to consider now. Will my posting of a new Bodiam article increase the amount of hits I get for this? Are people massively disappointed when the Mighty Google sends them here rather than elsewhere? Why dont people use Yahoo to search for Bodiam?

The other common terms people use for an “images search” are:

  • Schwarzenegger
  • Nice Art
  • Fine Houses
  • Holy Wafer
  • Jesus Toast (around 5 people a day come here using that search term… MADNESS)
  • Future Castles

Now, some make more sense than others, but I can only guess at the disappointment people must feel when their searches lead them here.For completeness, the most common search terms that bring people to this site are:

  • HDR How To (use Photomatix)
  • Cool Viking Names (well all of them)
  • Bad Journalist (again, all of them)
  • Firefox Memory Hog (it is)
  • Pipex Download Speeds (almost non-existent)
  • McCanns Blog (wrong place, I didn’t even know they had one)

One last point, a bit of an oddity is a search term Feedburner has identified leading some poor unfortunate here: “blog: I cannot read, feel distracted” – I have no idea what this blog has to offer this poor person.

Wikia search project

Internet search engines tend to be perfect examples of the proverb “To them that have shall be given.” (I guess this is a Biblical quote. The “hath” suggests it anyway.)

Get a top ranking on Google and you can guarantee your site will get loads of hits. Which will up your ranking. Which will get you more hits. And so ad infinitum.

Which must be great if you are the website equivalent of Coca Cola. But is a bit of an obstacle when you are Joe Nobody’s Homemade Dandelion and Burdock Drink.

So it’s good that an open source Wikia Search project is slowly being brought into existence. The idea is that an open source search algorithm will inspire more confidence in the results. At the least, it will let website owners know what the goalposts are.

New Scientist of 12th June 2007 (Yes, I know, it obviously takes me a while to process information) described the Wikia search project as the project of a “rebellious group of software engineers” determined to topple Google.

Apparently, one of the biggest problems is the shortage of mountains of cash to set up global data centres to match those of Google and Microsoft. According to New Scientist, one possible solution is to use a grid computing model, along the lines of SETI, with the search processing distributed around the world on volunteer’s PCs.

Most of the stuff on the Wikia site at the moment is concerned with the project itself. There is an about page . It looks as if development has stalled a bit since the initial start push in 2004, though. (Which suggests that New Scientist is even slower than me at processing information.)

Here’s an extract from Wikia Search on some of the ranking problems they intend to address:

Several other strategies to cheat or game the search engines are based on the fact that many search engines consider a hyperlink to a site to be a ‘vote’ for that site or measure of popularity. The use of hyperlinks as an indicator of website ‘quality’ led to link exchanges, link farms, bulletin board spam and other strategies to boost sites. Search engines responded by attempting to algorithmically evaluate the quality of each page, and discount links on sites or pages of little real value. While these algorithms to assess quality have neutralized millions of web pages, they have not (and cannot?) objectively determine the value and context of all the links on the web. The number of links to a page remains one of the biggest factors in how a page ranks in conventional search engines, and remains a prime area of interest for black-hat and grey-hat SEO.

Anything that can cut down the number of pointless spam sites that can clutter up the first few dozen pages of search results from standard search engines will be a big step forward.

I hope they solve the problems and this idea takes off. I’d volunteer my puny computing power and some of my bandwidth. Persuading ISPs not to do the choking-at-peak-times thing that they have started sneaking in through “Fair use” policies might be an obstacle though.

ShopWiki, DoubleClick

I was reading a post on Matt Mullenweg’s blog (PhotoMatt), titled “DoubleClick and Kevin Ryan” which talks about Google having bought double click, and Kevin Ryan (co-founder) has moved on to a new start up called ShopWiki. (I am not going to link to them though).

Basically ShopWiki sends bots out to trawl the web and find products at the best price for you. You may think this is a wonderful thing, and it may well be. I am somewhat intrigued though as to why this site (notable for its abject lack of sellable items) has been getting hammered by the Shop Wiki bot for most of the last two days (until it got the .htaccess treatment). As far as I can see, the bot ignored the Robots.txt entry I put in for it (although my track record with this file is poor).

I think the idea behind ShopWiki seems sound and I am sure it is a wonderful new idea. But I have to question the validity of the data it has collected, given the time and effort it spent looking round the contact pages here. In a spambot like fashion, the ShopWiki bot seems to have concentrated on pages which made reference to emails and the like.

Time may modify my point of view, but for now I think of this as a Bad Shop.

Tagging the untagged

This blog has been going through some traumatic changes to its functionality.

It doesn’t look much different because most of the changes to its appearance were repellent in IE6 and earlier browsers, although they looked great in IE7, so it’s temporarily reverted to a look which it’s had for .. oh, I don’t know… all of about 6 weeks.

The main differences for visitors is that you can find much more by tags, as if the blog was trying to be a mini-Technorati. You can open the Tag Archive page and search on several tags. (These are even presented in a tag cloud.)

The big difference for us is that we can tag things by just clicking on them. Adding tags used to be like pulling teeth. It probably contributed to my blogs being unfeasibly long because I couldn’t bear to have to go through the tagging process again (like a graffiti artist with a sore arm?) So the outcome should be less blog words, more tag words. Or at least, more tag words.

However, we don’t have full tagging liftoff yet.The older posts either don’t have any tags or only have WordPress category tags. By older, I mean “up to January 2007”. So that’s nearly all of them. As the posts here go back over a year, it’s an arduous task to add tags and it’s getting done piecemeal. All the same. it should be possible to find most of what we have for most of the topics.

And by the way, why do people keep typing “none” into the search bit in the header? This is just bizarre. It’s not when people click on the search box without putting anything in, because that brings up a blank page.

Web traffic analysis=nonsense

What is it with search engines? and web-traffic rankers?

This blog has done enough whining about Technorati’s randomness. It’s well overdue to say that it’s probably working far more consistently and reliably than most of the facilities that claim to find Internet resources. (On a note that shows how shamelessly susceptible to flattery we are at whydontyou.org.uk – others please take note – it puts this blog at under 60,000 in the blogosphere which is almost beyond its wildest dreams.)

As an experiment, look up your blog in a few search engines. See if you can find any points in common between them.

Here’s one of my favourites in that I suspect they actually must a randomiser to generate web traffic numbers and links. Pick a blog, look at it in technorati’s blog directory.

Go to the traffic rank bit and click on it. You will find yourself in the realm of Alexa. This will probably show you that the traffic isnt really counted because the blog isn’t in the top 100,000. The daily page views are shown as a percent of people using the whole Internet, i.e., if the site isnt in the top 100,000 sites in the world, you wont get any figures. (If you come in at a newbie 5,195,452 – as this blog does – you may wonder if you are even reading the blog yourself)

100,000 sounds like a lot of sites. However, if you consider, global players (like Google or Microsoft), then big online retailers (like Tescos and Dell), then news sites (CNN, BBC) and national government information sites, you can see it must be pretty difficult to get into the club.

Beneath this blank chart, you will see “Percent of Internet users who visit this site” with a fraction of a percent if it’s anything like this one. (Maybe you’re Microsoft, in which case i guess it will be higher. Will check shortly.)
Then “average number of pages visited” and “3 months average traffic rank” (risibly low) and average page views per visitor (1) (1 🙂 Do you suspect that’s hard-coded?)

But the next bit is what creases me up for its randomness. People who visit this site come from (in order of most visits):

United States 40.0% (fair enough, the blog’s in English. Most English-speaking Internet users are in the USA)
France 20.0%
India 20.0%
Costa Rica 10.0%
United Kingdom 10.0%

Whydontyou.org.uk traffic rank in other countries: (These seem to be the same countries to me)
Costa Rica 46,349
India 167,900
France 170,280
United States 658,841
United Kingdom 703,872

Come on…. To what do we owe this unprecedented popularity in Costa Rica? India? France? This is a UK-based blog. Most of the stuff we witter on about, apart from atheism and technology, relates to the UK.

It’s not that I don’t want to believe it. A central American flavour to its posts would make this blog much more interesting. I just think the figures have been made up.

OK, let’s look at the sites that link here, according to Alexa. These are so out of date, that it’s obviously not been updated since the blog was a couple of months old. In fact, until I submitted a more recent image, Alexa had a screen shot of the blog that was well over a year old. (Yes, I know, that’s like saying “We don’t get enough spam here, please deluge us with as much as you can possibly manage”.) Maybe because of their age, the sites listed in some of these links are unrecognisable. In fact none of the blog links would be counted by Technorati, being over a year old, but then, it shows no links that Technorati counts (under 90 days.)

Let’s search for this blog on Google. Here, it’s wierder. There are few points of comparison between different Google results, if you repeat the search over a day or so. Maybe it’s just how Google treats blogs, but the post that comes up first is always the same one from a few months ago. Other posts can only be seen by asking for similar results, excluded the first time for being the same. Well, guess what Google, every post is different. It’s a blog. Lots of the other Google results for the blog are bits of the RSS feed. I’d like to think that lots of people are devouring the RSS feed, but, unfortunately, these tend to be link farms. In fact, lots of obscure references to the blog linkfarm sites turn up on Google, most being complete news to us. Real human-created references to the blog don’t turn up as often as they actually happen.

I could go on to the point where I was boring even myself.

None of this would matter if getting seen and indexed correctly wasn’t crucial to getting any visitors. I know that indexing engines and search engines are bomabarded with spammers trying every trick there is to get high on the first results page. The search engines have algorithms that are supposed to penalise sites and blogs that don’t match their definition of legitimate – density of keywords, number of inbound links, and so on. I believe that not only are these not working, they are often acting in exact reverse to their intentions.

Content from blogs get scraped and put into blag sites that exist just to spew out other people’s content. Google then decides the original source site has “duplicate” content and downranks it. How do you stop this without stopping legitimate blogs from commenting on your posts?

Keywords in the metatags don’t match teh keywords in the text? Well, duh, normal human beings aren’t thinking only of page rank. So they put keywords in their metatags then write content, without remembering to keep changing the metatags. Only people obsessed with search engine rankings do that and ,of course, a fair percentage of them aren’t just bloggers or normal website owners.

It’s not just a question of getting visitors. Anyone who wants to bring in revenue from their site or blog by displaying adverts gets judged by these bizarre standards. Some schemes base what they send you on your Alexa rating, which is itself derived from Google’s well-nigh arbitrary page rank . If you’ve ever tried to have GoogleAds on a site, you’ll see how abstract the GoogleAds process is. In fact, visitors who think they’re helping you pay for the site, so click a few times on your ads every time they visit will get you disqualified. Ditto, your rivals……. (It seems as if you get automatically disqualified anyway, at the very point that you might actually receive any revenue.)

I know it must be well nigh impossible to filter the enormous volume of material in the Internet, especially in the face of the number of spammers there are. However, there must be better ways of doing it. I am always amazed when people find things here and comment or email us about them. How do they manage to find it?

So here, is an unaccustomed prop for Technorati (unaccustomed for this blog, anyway, whioch has done its fair share of ranting about it). For all the irritating Technorati monster error messages and totally inconsistent service, Technorato remains the best performing indexing service that I’ve come across yet. The tags are really helpful when they work. You can still find an interesting read on someone’s first post. And Technorati isn’t yet totally under the sway of the giant players. The fabled Web 2.0 stuff really does still have something going for it.