Wikia search project

Internet search engines tend to be perfect examples of the proverb “To them that have shall be given.” (I guess this is a Biblical quote. The “hath” suggests it anyway.)

Get a top ranking on Google and you can guarantee your site will get loads of hits. Which will up your ranking. Which will get you more hits. And so ad infinitum.

Which must be great if you are the website equivalent of Coca Cola. But is a bit of an obstacle when you are Joe Nobody’s Homemade Dandelion and Burdock Drink.

So it’s good that an open source Wikia Search project is slowly being brought into existence. The idea is that an open source search algorithm will inspire more confidence in the results. At the least, it will let website owners know what the goalposts are.

New Scientist of 12th June 2007 (Yes, I know, it obviously takes me a while to process information) described the Wikia search project as the project of a “rebellious group of software engineers” determined to topple Google.

Apparently, one of the biggest problems is the shortage of mountains of cash to set up global data centres to match those of Google and Microsoft. According to New Scientist, one possible solution is to use a grid computing model, along the lines of SETI, with the search processing distributed around the world on volunteer’s PCs.

Most of the stuff on the Wikia site at the moment is concerned with the project itself. There is an about page . It looks as if development has stalled a bit since the initial start push in 2004, though. (Which suggests that New Scientist is even slower than me at processing information.)

Here’s an extract from Wikia Search on some of the ranking problems they intend to address:

Several other strategies to cheat or game the search engines are based on the fact that many search engines consider a hyperlink to a site to be a ‘vote’ for that site or measure of popularity. The use of hyperlinks as an indicator of website ‘quality’ led to link exchanges, link farms, bulletin board spam and other strategies to boost sites. Search engines responded by attempting to algorithmically evaluate the quality of each page, and discount links on sites or pages of little real value. While these algorithms to assess quality have neutralized millions of web pages, they have not (and cannot?) objectively determine the value and context of all the links on the web. The number of links to a page remains one of the biggest factors in how a page ranks in conventional search engines, and remains a prime area of interest for black-hat and grey-hat SEO.

Anything that can cut down the number of pointless spam sites that can clutter up the first few dozen pages of search results from standard search engines will be a big step forward.

I hope they solve the problems and this idea takes off. I’d volunteer my puny computing power and some of my bandwidth. Persuading ISPs not to do the choking-at-peak-times thing that they have started sneaking in through “Fair use” policies might be an obstacle though.

Good science

Today’s excellent Bad Science discusses some reasons for why drug trials remain completely in the hands of Big Pharma. Ben Goldacre mentions the Cochrane Collaboration as an alternative.

If we ever had a scientist in charge of health, instead of tinkering with payments to big pharma, they would do one simple thing: move hell and high water to collect and collate the best and cheapest evidence on healthcare. First you would give huge amounts of money to the Cochrane Collaboration, which collects and collates data independently on all healthcare interventions (and is quietly one of the most subversive organisations ever to be created, because it blows the lid on false commercial claims

Wow. This is an amazing resource. They provide free “Plain Language” abstracts and summaries of the research. I tried a few. Go to the reviews page and choose a topic from the drop down list.

I was a bit constrained by having to pick things that I could understand the arguments about. Depression, anxiety and neurosis seemed a fair start, so I chose that. Nice comprehensible summary.

Active placebos versus antidepressants for depression
Tricyclic antidepressants are only slightly better than active placebos.
This review examined trials which compared antidepressants with ‘active’ placebos, that is placebos containing active substances which mimic side effects of antidepressants. Small differences were found in favour of antidepressants in terms of improvements in mood. This suggests that the effects of antidepressants may generally be overestimated and their placebo effects may be underestimated.

Drugs and alcohol?

Alcoholics Anonymous and other 12-step programmes for alcohol dependence
…The available experimental studies did not demonstrate the effectiveness of AA or other 12-step approaches in reducing alcohol use and achieving abstinence compared with other treatments, but there were some limitations with these studies.

Alternative AIDs treatments?

Herbal medicines for treating HIV infection and AIDS?
There is no compelling evidence to support the use of the herbal medicines identified in this review for treatment of HIV infection and AIDS.

(Although another link says exercise can be beneficial)

What an amazing resource. A busy healthcare worker can access the results of research from around the world in a few mouse clicks. A nosy member of the public, e.g. me, can find a better source of medical information than the daily scare stories/new miracle cures stories. It’s one of those things that makes you realise what the Internet can achieve.

There’s a word for this

Are you female? Do you spend your day talking about “accessorising”, “Pilates”, “size zero”, “superfoods”, “cellulite” and “kitten heels”? I thought not.

On the BBC website, there is a piece about “research” showing that women do not talk more than men but apparently have a larger vocabulary. This is “research” in the sense that it isn’t actually attributed but its results are amazingly specific.

Researchers in the US have laid waste to the long-held belief that women talk more than men. But the survey did find that female subjects get through an average of 16,215 words a day, compared with their male counterparts’ 15,669, a difference of 546

As a “lighthearted” talking point, the BBC lists some candidates for what these words might be (supposedly after consulting some women) and asks for other suggestions. There are a few phrases in the list that might be considered to be genuine specifically female concerns but most of the suggestions are what you’d get if you switched on a stereotyping machine and programmed it to reproduce the thoughts of a latter-day Bernard Manning.

I shouldn’t think I have to spell it out but I am going to anyway.

The general assumption behind most of this list is that women are complete airheads, mental sponges for Heat magazine. If the BBC had produced a similar list based on “jokey” “racial” chracteristics, you would expect that the website would have been (rightly) shut down by now.

But, hey, it’s all light-hearted fun. It couldn’t possibly be part of the cultural construction of masculinity and feminity, could it?

Thanks and sorry

Infinite thanks to the people who’ve taken the trouble to comment on the downward spiral that is this blog’s theme. It has been a great help.

And an apology is for the fact that the redesign is interfering with there actally being any readable content.

On a “while the cat’s away” basis I’ve done some theme hacks that will bring down the wrath of TW for being deprecated and/or non-compliant (negative margins, for instance. I’ve havent cracked and used tables yet.)

I think that it works in ie6 at 100% and doesnt degrade too badly when you shrink the window.

Havent even broached ff yet so it may get rebuilt in the next few minutes and I am pretty fearful of what will happen on mobiles.

Technical Problems

It seems Heathers problems with the crappy service from Virgin Media is not the only issue hitting this blog at the moment. Looking at the stats, it seems an awful lot of people are reading this blog via it’s feed rather than actually visiting the URL.

There seem to be two main options for people getting the feed. It either comes straight from the blogs feed (here) or from the feedburner feed. Recently, although the blogs feed has been working fine and includes all the latest posts, the feedburner one has been showing problems.

When I tried to investigate this, it seems that trying to connect to the blog feed from feedburner returns a server timeout error. This is not apparent why you try to access the feed direct, but also appears when you try to use feedvalidator to check the feed is valid. I find this really, really strange. Does any one have any ideas?

[tags]Technology, Feed, RSS, WhyDontYou, Why Dont You, Website, Site Admin, XML, Feedburner, Feedvalidator, Rants[/tags]

Site Admin

As mentioned previously, we have added some new plugins here and it seems they have broken the theme quite badly. This leaves us with two choices, and as the theme itself has become a bit annoying (second thoughts over the sponsored links, for example), it seems it is time to try out a new theme. The last time we tried this, there were problems in the old versions of IE so I would appreciate it if people can let us know what you think of the layout of the site now.

In addition a “struggle” is ongoing (offline) against the CSS files but when it is won (and by Toutatis it will be WON), we will be moving (again) to a bespoke theme. Thank you for your patience.

Themes and Upgrades

Well, it seems my hopes that the last theme I tried out would be the “be all and end all” theme for the blog were dashed against the rocks of reality.

It seems that something on the theme “Cleaker 2.1” was quite badly broken when viewed in IE6. This is a big shame because I really did like that theme. However, more than 35% of the hits this site gets are from IE6 (with almost another 5% coming from IE versions older than 6), so this is not a problem we can ignore.

Screenshot Showing the Site ThemeThere is now a new theme (minor additional changes may have taken place) and the image you see here shows how it is expected to look. If you are seeing something radically different from this can you please let us know?

Although I am not as enamoured with this theme as I was the previous one, it appears to work even in old versions of IE so it may be kept for a while.

This leads me on to another, important (to me) issue. If you are using IE 6 or older – UPGRADE! Please, for the love of Tim Berners Lee get a more modern browser. I am loathe to say IE7, but it is better than IE6. For the 0.4% of you who insist on using IE 4 or older, you really are missing out a lot of what the internet has to offer. I mean, people talk about Web 2.0 and there are still around 5 people a day who come here using Web 0.1beta browsers…

Download FireFox, Opera, Mozilla, SeaMonkey or even (gasp) IE7. They are all free!

Well, at least I have got that of my chest.

(p.s. before any Apple / Linux / BSD etc people pipe up – Windows accounts for over 75% of the traffic to this site)

Tagging the untagged

This blog has been going through some traumatic changes to its functionality.

It doesn’t look much different because most of the changes to its appearance were repellent in IE6 and earlier browsers, although they looked great in IE7, so it’s temporarily reverted to a look which it’s had for .. oh, I don’t know… all of about 6 weeks.

The main differences for visitors is that you can find much more by tags, as if the blog was trying to be a mini-Technorati. You can open the Tag Archive page and search on several tags. (These are even presented in a tag cloud.)

The big difference for us is that we can tag things by just clicking on them. Adding tags used to be like pulling teeth. It probably contributed to my blogs being unfeasibly long because I couldn’t bear to have to go through the tagging process again (like a graffiti artist with a sore arm?) So the outcome should be less blog words, more tag words. Or at least, more tag words.

However, we don’t have full tagging liftoff yet.The older posts either don’t have any tags or only have WordPress category tags. By older, I mean “up to January 2007”. So that’s nearly all of them. As the posts here go back over a year, it’s an arduous task to add tags and it’s getting done piecemeal. All the same. it should be possible to find most of what we have for most of the topics.

And by the way, why do people keep typing “none” into the search bit in the header? This is just bizarre. It’s not when people click on the search box without putting anything in, because that brings up a blank page.

Web traffic analysis=nonsense

What is it with search engines? and web-traffic rankers?

This blog has done enough whining about Technorati’s randomness. It’s well overdue to say that it’s probably working far more consistently and reliably than most of the facilities that claim to find Internet resources. (On a note that shows how shamelessly susceptible to flattery we are at whydontyou.org.uk – others please take note – it puts this blog at under 60,000 in the blogosphere which is almost beyond its wildest dreams.)

As an experiment, look up your blog in a few search engines. See if you can find any points in common between them.

Here’s one of my favourites in that I suspect they actually must a randomiser to generate web traffic numbers and links. Pick a blog, look at it in technorati’s blog directory.

Go to the traffic rank bit and click on it. You will find yourself in the realm of Alexa. This will probably show you that the traffic isnt really counted because the blog isn’t in the top 100,000. The daily page views are shown as a percent of people using the whole Internet, i.e., if the site isnt in the top 100,000 sites in the world, you wont get any figures. (If you come in at a newbie 5,195,452 – as this blog does – you may wonder if you are even reading the blog yourself)

100,000 sounds like a lot of sites. However, if you consider, global players (like Google or Microsoft), then big online retailers (like Tescos and Dell), then news sites (CNN, BBC) and national government information sites, you can see it must be pretty difficult to get into the club.

Beneath this blank chart, you will see “Percent of Internet users who visit this site” with a fraction of a percent if it’s anything like this one. (Maybe you’re Microsoft, in which case i guess it will be higher. Will check shortly.)
Then “average number of pages visited” and “3 months average traffic rank” (risibly low) and average page views per visitor (1) (1 🙂 Do you suspect that’s hard-coded?)

But the next bit is what creases me up for its randomness. People who visit this site come from (in order of most visits):

United States 40.0% (fair enough, the blog’s in English. Most English-speaking Internet users are in the USA)
France 20.0%
India 20.0%
Costa Rica 10.0%
United Kingdom 10.0%

Whydontyou.org.uk traffic rank in other countries: (These seem to be the same countries to me)
Costa Rica 46,349
India 167,900
France 170,280
United States 658,841
United Kingdom 703,872

Come on…. To what do we owe this unprecedented popularity in Costa Rica? India? France? This is a UK-based blog. Most of the stuff we witter on about, apart from atheism and technology, relates to the UK.

It’s not that I don’t want to believe it. A central American flavour to its posts would make this blog much more interesting. I just think the figures have been made up.

OK, let’s look at the sites that link here, according to Alexa. These are so out of date, that it’s obviously not been updated since the blog was a couple of months old. In fact, until I submitted a more recent image, Alexa had a screen shot of the blog that was well over a year old. (Yes, I know, that’s like saying “We don’t get enough spam here, please deluge us with as much as you can possibly manage”.) Maybe because of their age, the sites listed in some of these links are unrecognisable. In fact none of the blog links would be counted by Technorati, being over a year old, but then, it shows no links that Technorati counts (under 90 days.)

Let’s search for this blog on Google. Here, it’s wierder. There are few points of comparison between different Google results, if you repeat the search over a day or so. Maybe it’s just how Google treats blogs, but the post that comes up first is always the same one from a few months ago. Other posts can only be seen by asking for similar results, excluded the first time for being the same. Well, guess what Google, every post is different. It’s a blog. Lots of the other Google results for the blog are bits of the RSS feed. I’d like to think that lots of people are devouring the RSS feed, but, unfortunately, these tend to be link farms. In fact, lots of obscure references to the blog linkfarm sites turn up on Google, most being complete news to us. Real human-created references to the blog don’t turn up as often as they actually happen.

I could go on to the point where I was boring even myself.

None of this would matter if getting seen and indexed correctly wasn’t crucial to getting any visitors. I know that indexing engines and search engines are bomabarded with spammers trying every trick there is to get high on the first results page. The search engines have algorithms that are supposed to penalise sites and blogs that don’t match their definition of legitimate – density of keywords, number of inbound links, and so on. I believe that not only are these not working, they are often acting in exact reverse to their intentions.

Content from blogs get scraped and put into blag sites that exist just to spew out other people’s content. Google then decides the original source site has “duplicate” content and downranks it. How do you stop this without stopping legitimate blogs from commenting on your posts?

Keywords in the metatags don’t match teh keywords in the text? Well, duh, normal human beings aren’t thinking only of page rank. So they put keywords in their metatags then write content, without remembering to keep changing the metatags. Only people obsessed with search engine rankings do that and ,of course, a fair percentage of them aren’t just bloggers or normal website owners.

It’s not just a question of getting visitors. Anyone who wants to bring in revenue from their site or blog by displaying adverts gets judged by these bizarre standards. Some schemes base what they send you on your Alexa rating, which is itself derived from Google’s well-nigh arbitrary page rank . If you’ve ever tried to have GoogleAds on a site, you’ll see how abstract the GoogleAds process is. In fact, visitors who think they’re helping you pay for the site, so click a few times on your ads every time they visit will get you disqualified. Ditto, your rivals……. (It seems as if you get automatically disqualified anyway, at the very point that you might actually receive any revenue.)

I know it must be well nigh impossible to filter the enormous volume of material in the Internet, especially in the face of the number of spammers there are. However, there must be better ways of doing it. I am always amazed when people find things here and comment or email us about them. How do they manage to find it?

So here, is an unaccustomed prop for Technorati (unaccustomed for this blog, anyway, whioch has done its fair share of ranting about it). For all the irritating Technorati monster error messages and totally inconsistent service, Technorato remains the best performing indexing service that I’ve come across yet. The tags are really helpful when they work. You can still find an interesting read on someone’s first post. And Technorati isn’t yet totally under the sway of the giant players. The fabled Web 2.0 stuff really does still have something going for it.

Compuskills Site Redesign

I dont know if you have had the chance to read the Compuskills Web Design Blog recently, but as mentioned there we have been working on a fairly large overhaul on the Compuskills site. Most of the work has been to convert the previous pages into a database served CMS, with only minor cosmetic changes.

Everything has gone well, the new site should be live tonight and we can get back to the business of ranting and raving here again 🙂

Site Traffic

Amazingly, even though this month is only just half over, this blog has generated 10,068 unique visits this year. Wow. To put this in perspective, while there has been a steady increase over the last six months of 2006, the highest monthly total was around 9,000 unique visits. If the current trend continues we should break the 20,000 mark before too long. Thank you to everyone who visits!