El Reg has an interesting article on Google and how its ranking system is, effectively a black art. For a company which claims to “do no evil” it is bizarre how closed they keep their methods – surely shining the light of openness on how they work would be the “good” thing to do. While it might increase the risk of black hat SEO working, surely it would make it easier for everyone else as well. Why does the Google Search Ranking algorithm have to be a secret? Read more at: http://www.theregister.co.uk/2009/11/19/google_hand_of_god/
While I have been busy over the last few weeks I’ve been unable to spend time looking at the stats for my Flickr images – this is unusual because all of us at WhyDontYou towers are somewhat stats obsessed. However, I found time to catch up today and what a bit surprised.
Now there is a fairly consistent amount of views on the photos in my Flickr stream with predictable changes when I add new photos or put some effort into getting them more visibility. Unsurprisingly, almost all my visitors come from Flickr with a rare few being driven from various blogs (hardly ever here…shame on you all) or other sites. Today this consistency was there, with one exception – this image:
For some reason, this image has been getting more traffic than any other image over the last month, and almost all this traffic is coming from google images. Over the last month, 91% of the traffic to this image has come from google images (almost all using “Animals” as the search term) while a further 8% has come from Yahoo Images (even more bizarrely, this is normally from a search for “Boxing”…) with almost none coming from organic Flickr searches. I have tried both google and Yahoo searches and I cant make this image appear in the first 10 pages or so of either engine.
I find it a bit strange that enough people are searching for animal and / or boxing images to wade through pages of results before deciding to visit this one. I find it equally strange that, given the number of animal images I have on flickr, this effect seems isolated to this image. It has gone from relative obscurity (around 1000 views in total) to one of the most viewed images in my photostream – currently 5,556 views in just over a month. (Not that I am complaining but none of these people even leave comments!)
If anyone has any insight as what may be causing this, I’d love to hear it.
Come back flatterspam, all is forgiven.
For the past few days, the blog has been getting gibberish comment-spam, in oddly large numbers, almost at DDOS attack levels. (OK, I exaggerate but there were over 380 yesterday, 51 today.) Some of these comment spams are particularly weird, in that even the URLs are gibberish.
It’s not as if the random word generators have generated text in any known human language, that could trick the unwary into clicking on a link to onlinefakemeds.com or whatever. The URLS themselves are also random letter collections, with names like Mr._Mxyzptlk but less meaningful.
Charitably assuming that spammers have not completely taken leave of their senses, I guess that these suidfiojdfolsrkl.comstyle links go to redirects and do eventually take the unwary URL-clicker somewhere. (Obviously, I’m not going to try them out. I’m enough of a sucker for any worm or trojan anyway.)
But still, what is the point? It seems even less likely that people would click on a gibberish link in a mound of gibberish than that they would believe that a complete stranger in Africa would pay them ten percent for the assistance in transferring 64 million dollars.
A few more blog-related odds and ends, now I’m on the subject:
Apologies to anyone who expects to get email alerts about new posts here. This plug-in has just stopped working. We don’t know what happened so we have less than no idea how to change it.
The Atheist blogroll got broken so long ago, it’s almost a distant memory. Again apologies. We threw it away a few months after it got stuck permanently showing last August’s posts of about ten blogroll members. (Or something like that.)
Other things just randomly break anyway. For instance, there was a link to the Convention on Modern Liberty that only lasted a week or so.
Plus, this blog can load so slowly (even on my allegedly very fast connection) that It’s hard to see why anyone bothers waiting for it.
Except for all those visitors who are looking for Schwarzenegger, 5 fruit and veg, funny magic the gathering cards, Bodium castle, fairytale castles, fine art or morris dancing. These are the top search terms that consistently bring people here from Google. Every day.
Now, I am all for giving the public what they want, but there’s only so much that I have to say on any of these topics. So, most of these visitors must leave a little disappointed, to put it mildly.
This blog needs a serious “REDO FROM START.” It should happen soon…..
Everyone is scared about malware and hacking on the web. There is nothing wrong with this and there really is a genuine threat out there. People need to make sure that their browsing is as safe as possible. For most people, unless you are running a high volume internet banking transaction server this can be simply done by getting a good anti virus (AVG Free is cost effective) and a firewall (windows own, Zone Alarm or one on your router).
Despite this a lot of online organisations feel the need to join in and help out. Most modern browsers have built in “phishing filters” and will try to alert you when you click on what it thinks is an untoward link. This is all well and good and there are only minimal privacy implications.
Equally, search engines are doing the same thing now. When you google a search term, you get links with any potentially harmful ones highlighted. Just in case you ignore google’s advice, they have a blocking page pretty much ensuring you cant click through to malware from google. Again, this may seem all well and good but there are even more issues. For a start, it is down to google to decide what is, or isn’t malware. They may be correct 99% of the time, but what about the other 1%? It becomes the responsibility of the website owner to discover they have been flagged as “malware” by google and then jump through google’s hoops to clear their name. This is wrong.
More importantly, who is responsible when there is a problem with google? A sensible hacker could target google’s servers and create the illusion that certain companies are full of malware. It would take a brave person to ignore the warnings and keep going through to a site that is so heavily flagged on the search page.
Do you think this is unrealistic? Here is the results of a search I did today on www.google.co.uk – imaginatively I searched for “Google”:
The whole internet is infected with malware. Every link is flagged with the dire warning it may harm your computer. I am not alone in discovering this… (PCPlus simply suggests using another search engine for the afternoon, Neowin is more informative) Google isn’t hacked (this time), its just broken. The effect is the same though. Any attempt to search meets with this warning and googles intervention means you cant ignore it and click on. Well done Google – you have borked searching… Amazing.
This is (IMHO of course) the problem with allowing web services to have more and more control over our daily lives. It is bad enough that the most popular search engine on the internet suffers a glitch like this, but imagine if you were using Google to host your remote office systems – an outage can be crippling. Cloud computing may be in vogue, but it is fundamentally a bad idea. You can not delagate your responsibilities to unaccountable groups – you are responsible for making sure no malware gets on your PC, so why does google feel the need to intervene?
I decided to have a cup of coffee rather than randomly searching Google for a few minutes. For the good of the planet.
The Sunday Times reported that 2 Google searches have the the same carbon footprint as boiling water for a cup of tea. (I am hoping the same applies to coffee but I’m erring on the side of caution by forsaking half a dozen notional searches.)
These statistics aren’t completely convincing, being generated, as they were, by a guy who’s set up a website to sell a clean conscience to websites.
People want websites they visit to be eco-friendly. CO2Stats helps you attract and retain those visitors.
CO2Stats is the only service that automatically calculates your website’s total energy consumption, helps to make it more energy efficient, and then purchases audited renewable energy from wind and solar farms to neutralize its carbon footprint – all for a flat, affordable monthly fee. (from co2stats)
The estimated carbon footprint of your search varies wildly between
[Wissner-Gross's] research indicates that viewing a simple web page generates about 0.02g of CO2 per second. This rises tenfold to about 0.2g of CO2 a second when viewing a website with complex images, animations or videos. (from the Sunday Times)
So, “stick to really dull webpages and don’t visit YouTube or sites that use Flash” sounds more immediately effective advice than buying spurious energy credits.
In any case, this turns out to be at the low-end of the carbon footprint estimates:
….. carbonfootprint.com, a British environmental consultancy, puts the CO2 emissions of a Google search at between 1g and 10g, depending on whether you have to start your PC or not. Simply running a PC generates between 40g and 80g per hour, he says. Chris Goodall, author of Ten Technologies to Save the Planet, estimates the carbon emissions of a Google search at 7g to 10g (assuming 15 minutes’ computer use).
Nicholas Carr, author of The Big Switch, Rewiring the World, has calculated that maintaining a character (known as an avatar) in the Second Life virtual reality game, requires 1,752 kilowatt hours of electricity per year. That is almost as much used by the average Brazilian.
Wait, if using a PC at all emits ~60g an hour, ie, 1g a minute, doesn’t that mean you are saving 0.8g a minute by looking at complex websites?
And that bit about “depending on whether you have to switch your PC on” is really confusing. (When I work out how to use my PC without switching it on, I’ll post the information here.)
I am sure that computer use is mostly a waste of energy. I am sure that big powerful servers are even greedier than my PC.
However, I’m not convinced by the idea that you can buy your way out of responsibility for ecological damage. Paying to generate some less-polluting-energy doesn’t mean that the more-polluting-energy you used before suddenly disappears.
Congestion charges, aviation carbon taxes and so on. They all suggest that you won’t cause ecological damage if you can afford to pay for it. It’s like buying and selling medieval indulgences.
This would be great if the Earth was susceptible to bribery. I think these schemes are usually just ways for us to avoid taking any real steps to stop destroying the Earth. In some ways, they are worse than doing nothing, because they give us the illusion that we are taking serious steps to save the environment and that we can do this without any major inconveniences.
And they give the climate-change deniers some pretty obvious strawmen to direct their denying at. For example, here are some of the comments on the Times article:
When does this global warming hysteria end. It seems like all these die-hard environmentalists would like us all living in huts with no electricity, comforts, or heating. Especially considering this freezing winter (against all predictions), I’d like to see them go first.
Like a mouse climbing up the leg of an elephant with rape on its mind. Global warming at/isn’t going to happen
I call for a moratorium on publishing articles like this one. The amount of CO2 generated when my head starts to steam is much higher than a Google search. Multiply that by the millions of sane people who agree with me that GW is a crock and GW might actually come true.
(Replace the misused “sane people” with a more accurate “Americans” and you get the flavour of a lot of these comments. What is it about living the USA that makes some people unable to see beyond their own carports?)
The calculations are ridiculous and blatantly misleading.
But no surprise, it appears that this will be another cold year and the “environmentalists” are running up and down in a total panic that they failed to fully socialize the world while for a few years was a bit warmer.
And why should we care how much energy Google uses…because of the myth of Global Warming that is being forced down our throats.
2007 was the warmest year on record, no wait, we were wrong about that, the warmest year was 1945. Artic sea ice will be gone soon, no wait, we were wrong about that
It looks as if even people who are too monumentally stupid to see that a cold year doesn’t in itself invalidate climate change are still bright enough to see that these figures are a bit bogus.
Why give them ammunition? The idea of a “carbon footprint” as an individual moral issue, susceptible to individual guilt and contrition is just mistaken. It’s obviously good to do whatever we can as individuals, but it’s a social and political issue, which needs serious social and political solutions.
(end opinionated rant.)
There is a new Google enterprise to get searchable digitised newspaper archives online. A great idea. (I’ve already had loads of educational fun with the Times archive and the Victorian British press archive that went subscriber only, just when it had completely engrossed me.)
The Google blog page has a link to Google’s press archive search but there’s a warning that you won’t find everything indexed. They suggest some searches.
Not every search will trigger this new content, but you can start by trying queries like [Nixon space shuttle] or [Titanic located]. Stories we’ve scanned under this initiative will appear alongside already-digitized material from publications like the New York Times as well as from archive aggregators, and are marked “Google News Archive.”
This instantly arouses my vapourware bullshit detector. Hmm. Space shuttle. The Titanic. First man on the moon… Maybe they’ve just stuck together a few very standard searches and plan to add lots more information as it becomes popular….. I feel impelled to test it a bit more rigorously.
I try a few off-the-wall searches. I pick the topics solely on the randomish basis that somebody’s mentioned the words to me in conversation today :
- “Dolph Lundgren” – 4,370 articles
- “Japanese swearword” – 279 articles
- “linear algebra” – 3,520 articles
- “Large Hadron Collider” – 3,370 articles.
- “Frozen vegetables” – 236,000 articles
Blimey. This actually works really well. I can’t claim to have clicked on more than a handful of links but the ones I did click on were legit.. It’s definitely not vapourware. It’s already damn good.
So, the big test, then. I’m going for my favourite indicator that a human twat-a-tron is at work. “Political correctness gone mad” gets 3,420 print archive hits.
Wait. I run it again, to see if the British press is represented. Just because I suspect that it must appear several times a day, so 3,240 seems a relatively small total. (It’s outnumbered by all the phrases above except “Japanese swearwords” and the consensus of press opinion seems to be that these don’t really exist.)
This time I get a mere 1,550 hits. Bloody inconsistent Google. Plus, the timeline is bizarre to say the least. It claims the first mention was between 1880 and 1559. The next was in 1782, then there’s one from 1805. … I think not. They are making these up. The 1958 ones looks like a mistake as well.
Closer inspection reveals that the “dates” have leaked in from elsewhere in an article. Most examples are huddled around the last 8 years. In fact there’s barely an instance of political correctness gone mad until 1998. It’s only in the past couple of years that the full flowering of the phrase has taken off.
“The PC brigade” (h/t Alun) got 467. Ignoring the dating oddities, these are also clustered around the turn of the century, with a linguistic take-off from 2000.
These numbers are tiny. Ah ha. Google hasn’t archived the Daily Mail. (No hits for “the Daily Mail is shit”, h/t Tom Donald)
Look, if they are only going to index serious newspapers, there is going to be no fun in this.
However, they must have archived a fair bit of newsprint crap, because “the Rapture” brings back a stunning 18,300 reports.
First mention is 0 AD
Cuil, Cuil ffs? Repress a shudder at the name. It’s a (relatively) new search engine. It’s good, although it’s had a bit of a critical drubbing. It’s much prettier than google. Its results make a lot of sense. It’s not stuffed with sponsored links or spam links or dominated by top-ten-authority corporate results. So I think I like it, although I’ve only used it on test basis.
I also really like Ubuntu. Of course, any Linux version is admirable. and Ubuntu is more admirable than most.
I am just going to have a pointless rant about the branding – calling things ethnic-sounding names to make perfectly good and worthy things sound just that bit more credible.
The wikipedia entry doesn’t do much to disspell any impulse to sneer at the Cuil name:
The Irish ancestry of Anna Patterson’s husband Tom Costello sparked the name Cuil, which the company states is taken from a series of Celtic folklore stories involving a character called Finn McCuill. The company says that Cuil is Irish for knowledge and hazel.
That’s “Irish ancestry” in the sense of “American Irish”, then? (One Irish great-great grandparent and an Irish surname qualify any American as Irish. Although I remain to be convinced that Costello really counts, here….)
Wikipedia does some serious undercutting of the legitimacy of the Irish ethnic explanation for the brandname, from a standpoint of linguistics. Which feeds my instinctive prejudice against the word, the spelling and its supposed “cool” pronunciation.
I used to get riled every time I saw claims that Ubuntu was the “African word for” something, as if Africa didn’t have more languages than any other continent in the world.
Ubuntu is an African word meaning ‘Humanity to others’, or ‘I am what I am because of who we all are’. The Ubuntu distribution brings the spirit of Ubuntu to the software world. (from Ubuntu.com)
I have to turn my pedantry against myself. That said “An African word for” not “The African word for”. Maybe I have been misjudged Unbuntu. I do a cuil search for “ubuntu is african for.” The first page is whole string of official ubuntu links, none of which say it is the African word for anything. In fact, many of the definitions that turn up are reasonably precise, a Zulu word and a South African philosophy.
My bad. I must have imagined the “African word for” phrase, misremembering the blurb from the old distro I have somewhere.
But google and cuil do both unveil an apparent subgenre of geek humour based on the misremembered “Ubuntu is African for”
Ubuntu is African for ‘Can’t configure Debian’. (typical link: Ubuntu forum post)
Indeed. ubuntu is african for ” I CANT CONFIGURE SLACKWARE”
(typical link: Another forum)
ubuntu is African for “time sucker”, right? (link: I-phone blog forum)
Ubuntu is African for “struggles to install mouses”. (from information rain)
Most off-the-wall is
Ubuntu is African for sharks with freaking laser beams on its head. (from animetro)
Am I beginning to see a pattern, here? I’ll have to try it.
Cuil is Irish for “excuse to use a disgustingly lame pun in a blog title”
It is rare to read the free bus paper – the Metro – without seeing at least one letter with a rant about “political correctness gone mad.”
Experiment: Counting the number of readers’ letters containing the phrase and working out a daily average, maybe comparing the result to the occurrence of some other nonsense phrase like “air conditioning walnuts.”
However, that would be a bit too much of a time and consciousness commitment, so I took the easy way out and googled.
Amazingly, google could only find 681 occurrences. Impossible. Doh, I misspelled the word and missed the first “i” out. Which makes the 681 occurrences quite impressive. (A truly dedicated social researcher would try every possible misspelling. Sorry.)
The correct tally is actually “about 61,000.” Even this seems a little on the low side, given the existence of the Daily Mail and the BBC’s Have your Say. I suspect I have been too specific to get a true picture of how often “Ranting Bigot” reaches for the conceptual green ink.
I put the phrase “political correctness gone mad” in quotes. This is an English usage. I’m not sure how thinking-constricted Americans say it. How do I make a direct translation of “gone mad” into US English, in which mad means “angry” rather than insane?
“political correctness run amok” gets 21,400. Quite a respectable tally but I don’t think I’m still getting the full flavour of it.
“political correctness run amuck” garners a further 4,230.
“political correctness gone insane” gets a modest 3,090.
“political correctness gone berserk” gets only 510, (plus one result for “political correctness gone bersek”, my misspelling again.)
Ok, I’m going for the big ones: The bald phrase “political correctness” gets about 5,060,000.
The phrase “politically correct” brings up 6,150,000 entries. There is some duplication here, though. Is anyone adding these up?
Oh Buggar. “air conditioning walnuts” – the control phrase in my experiment – brings up “about 1,240,000″ google hits. I kid you not.
Undaunted, I have to conclude that this might just show that there is no nonsense phrase too ridiculous to bring up millions of google hits. (And, at least, “air conditioning walnuts” doesn’t have me snarling when it appears on a web page.)
Are there any other Firefox users who have Gmail (Google Mail) accounts? If so, please put me out of my misery. Does your copy of firefox crash every single time you try and do something with your mailbox?
I am using Firefox 22.214.171.124, which as far as I can tell is the most up to date version. I have tried updating it and I have tried updating various other components on my computer. All to no avail.
Without fail, every time I go into Gmail the countdown to a crash begins. I can view all manner of other pages, have twenty tabs open and be downloading huge files. All fine. Try to click on a folder in Gmail and it is game over. I have sort of narrowed it down to something in the scripts on Gmail causing the crash but I am not totally sure (yet).
Recent examples: I tried to create a new filter… crash. I tried to view all starred mail… crash. I tried to view all emails with a given tag… crash. I tried to send an email… crash.
The only saving grace is I can read emails and, despite FF crashing on me it actually manages to send the emails. It is, in a nutshell, a nightmare. Fortunately Internet Explorer is perfectly functional with Gmail, but this makes it all the more annoying. During a given day, I wouldn’t have any reason to open IE if it wasn’t for bloody Gmail.
As far as I can tell, this is recent. I cant remember when it began but it must be less than a month ago.
Is it just my computer? Am I alone with this madness? Do Firefox developers get to see the 30 – 40 error messages my machine sends out each day?
I am an atheist has links to the-end.com:
2008: God’s Final Witness
Unprecedented destruction will come in 2008, leading to America’s fall
(Oh shit, I blogged about this very site’s endtimes nonsense a few weeks a go. For free. :-()
Called to be a monk, nun, priest? Take Free Online test Now To see if God is calling you.
From vocationsplacement.org. Well, OK, but I think I know the answer already so I’ll skip the online test and check out the free holiday destinations. Lose interest when I see that the destinations are generally states that I don’t recognise by their initials. I can’t find Hawaii. Look, I’m not prepared to pretend to be Catholic and pray for a couple of weeks for Wisconsin. But thanks for the offer.
Sexed-up Atheism- Dawkins Pantheism adds reverence for Nature, Universe, Life
(Pantheism.net) Well, no great argument with these people, except that I have a fastidious revulsion at the use of the term “sexed-up” to mean “slightly more interesting”.
The Enlightenment of the Healy (me neither) has an advert for hidden-advent.org:
Desiring Lord appearing? Expecting Lord’s return? A pleasant surprise is awaiting you
Hidden advent? Is that a really obscure pre-Christmas calendar? No. It’s one of the most eye-burningly ugly sites you’ll ever see. It deals with The Work of the Lord’s Hidden Advent In China However, the site is even less comprehensible than the title. I click on a link that says
Typical Cases of Leaders in Catholicism and Christianity in Mainland China who Resist Almighty God Being Punished
Not understanding the English, I have to click the link. I now understand even less than I did before.
It has a series of bizarre tables by province. Eg Henan.(65 Cases Selected) I pick a randomly numbered case:
Liu X from Dengzhou City, female, 48 years old, a believer from the Born Again denomination. In February 1999, someone preached God’s end-time work to her, but she didn’t accept it. In March, another person preached it to her again, but she said: “What you believe in is a false way and a cult. I just believe in Jesus. If I were to die, I would die under Jesus’ name.” Two months later, Liu X got uterus cancer, and she lost all her hair after chemotherapy.
In the autumn of 1999, the brothers and sisters preached God’s end-time salvation to her again, but she still resisted and condemned it. Right after that, her innards began to rot. She suffered unbearable pain and failed to respond to any medical treatment. ….. Her oath “rather die than believe” was fulfilled eventually.
They list “Two hundred cases selected from among tens of thousands of cases”. They all involve people dying a painful untimely death for not accepting the end-times idea. Who’d have thought there could be a religious group that compared unfavourably with the Phelps family?
Back in the real world, the advert still says “A pleasant surprise is awaiting you “. Loki forbid that they ever try to give a site visitor an unpleasant message….
I have been looking through the website logs to see just what it is that drives people to this site and, while lacking in raw comedy value (unlike some), it has been interesting.
Running a combination of Firestats, Feedburner and Google Analytics it seems this blog is getting around 400 visits a day. From these around 80% are new (which shows just what a non-loyal readership we hold…) and of those around 70% come here from a search engine – nearly all from Google. For the numbers-fans, this translates to about 200 hits a day from Google searches. Given the insanely varied nature of topics here, you would be excused for thinking this was reflected in the search stats. Not so.
Of the top ten search terms used to come here, seven are image searches, and this accounts for about 90 of the incoming hits. Even stranger, of these over a third are all searching for images of Bodiam Castle.
Now, Bodiam Castle is a gorgeous, fourteenth century fairytale castle in East Sussex, run by the National Trust, so I can understand why people are interested in it. In fact, I understand this well enough to have uploaded another photo!
If you have come here searching for Bodiam Castle, I hope you like this, and you can even see more on Flickr. It has been a long time since I have been to Bodiam so please, forgive me for the photos being out of date now. If you have links to other pictures of this gorgeous castle, please let me know and I will be more than happy to link to them from here.
Back onto the search topic, there is the determination issue to consider now. Will my posting of a new Bodiam article increase the amount of hits I get for this? Are people massively disappointed when the Mighty Google sends them here rather than elsewhere? Why dont people use Yahoo to search for Bodiam?
The other common terms people use for an “images search” are:
- Nice Art
- Fine Houses
- Holy Wafer
- Jesus Toast (around 5 people a day come here using that search term… MADNESS)
- Future Castles
Now, some make more sense than others, but I can only guess at the disappointment people must feel when their searches lead them here.For completeness, the most common search terms that bring people to this site are:
- HDR How To (use Photomatix)
- Cool Viking Names (well all of them)
- Bad Journalist (again, all of them)
- Firefox Memory Hog (it is)
- Pipex Download Speeds (almost non-existent)
- McCanns Blog (wrong place, I didn’t even know they had one)
One last point, a bit of an oddity is a search term Feedburner has identified leading some poor unfortunate here: “blog: I cannot read, feel distracted” – I have no idea what this blog has to offer this poor person.
I decided I would see if good ol’boy Chuck Norris was back blogging about his doing the “cruel and unusual” thing and visiting US troops in Iraq. (Yes, some people are just really easily amused. I admit it.)
I lazily typed “worldnet daily” into the Google search bar, forgetting I was in Google Images rather than standard Google.
Showing that the singularity is here and that computer networks have achieved sentience, it said
Did you mean: worldnutdaily?
Internet search engines tend to be perfect examples of the proverb “To them that have shall be given.” (I guess this is a Biblical quote. The “hath” suggests it anyway.)
Get a top ranking on Google and you can guarantee your site will get loads of hits. Which will up your ranking. Which will get you more hits. And so ad infinitum.
Which must be great if you are the website equivalent of Coca Cola. But is a bit of an obstacle when you are Joe Nobody’s Homemade Dandelion and Burdock Drink.
So it’s good that an open source Wikia Search project is slowly being brought into existence. The idea is that an open source search algorithm will inspire more confidence in the results. At the least, it will let website owners know what the goalposts are.
New Scientist of 12th June 2007 (Yes, I know, it obviously takes me a while to process information) described the Wikia search project as the project of a “rebellious group of software engineers” determined to topple Google.
Apparently, one of the biggest problems is the shortage of mountains of cash to set up global data centres to match those of Google and Microsoft. According to New Scientist, one possible solution is to use a grid computing model, along the lines of SETI, with the search processing distributed around the world on volunteer’s PCs.
Most of the stuff on the Wikia site at the moment is concerned with the project itself. There is an about page . It looks as if development has stalled a bit since the initial start push in 2004, though. (Which suggests that New Scientist is even slower than me at processing information.)
Here’s an extract from Wikia Search on some of the ranking problems they intend to address:
Several other strategies to cheat or game the search engines are based on the fact that many search engines consider a hyperlink to a site to be a ‘vote’ for that site or measure of popularity. The use of hyperlinks as an indicator of website ‘quality’ led to link exchanges, link farms, bulletin board spam and other strategies to boost sites. Search engines responded by attempting to algorithmically evaluate the quality of each page, and discount links on sites or pages of little real value. While these algorithms to assess quality have neutralized millions of web pages, they have not (and cannot?) objectively determine the value and context of all the links on the web. The number of links to a page remains one of the biggest factors in how a page ranks in conventional search engines, and remains a prime area of interest for black-hat and grey-hat SEO.
Anything that can cut down the number of pointless spam sites that can clutter up the first few dozen pages of search results from standard search engines will be a big step forward.
I hope they solve the problems and this idea takes off. I’d volunteer my puny computing power and some of my bandwidth. Persuading ISPs not to do the choking-at-peak-times thing that they have started sneaking in through “Fair use” policies might be an obstacle though.
I was reading a post on Matt Mullenweg’s blog (PhotoMatt), titled “DoubleClick and Kevin Ryan” which talks about Google having bought double click, and Kevin Ryan (co-founder) has moved on to a new start up called ShopWiki. (I am not going to link to them though).
Basically ShopWiki sends bots out to trawl the web and find products at the best price for you. You may think this is a wonderful thing, and it may well be. I am somewhat intrigued though as to why this site (notable for its abject lack of sellable items) has been getting hammered by the Shop Wiki bot for most of the last two days (until it got the .htaccess treatment). As far as I can see, the bot ignored the Robots.txt entry I put in for it (although my track record with this file is poor).
I think the idea behind ShopWiki seems sound and I am sure it is a wonderful new idea. But I have to question the validity of the data it has collected, given the time and effort it spent looking round the contact pages here. In a spambot like fashion, the ShopWiki bot seems to have concentrated on pages which made reference to emails and the like.
Time may modify my point of view, but for now I think of this as a Bad Shop.
What is it with search engines? and web-traffic rankers?
This blog has done enough whining about Technorati’s randomness. It’s well overdue to say that it’s probably working far more consistently and reliably than most of the facilities that claim to find Internet resources. (On a note that shows how shamelessly susceptible to flattery we are at whydontyou.org.uk – others please take note – it puts this blog at under 60,000 in the blogosphere which is almost beyond its wildest dreams.)
As an experiment, look up your blog in a few search engines. See if you can find any points in common between them.
Here’s one of my favourites in that I suspect they actually must a randomiser to generate web traffic numbers and links. Pick a blog, look at it in technorati’s blog directory.
Go to the traffic rank bit and click on it. You will find yourself in the realm of Alexa. This will probably show you that the traffic isnt really counted because the blog isn’t in the top 100,000. The daily page views are shown as a percent of people using the whole Internet, i.e., if the site isnt in the top 100,000 sites in the world, you wont get any figures. (If you come in at a newbie 5,195,452 – as this blog does – you may wonder if you are even reading the blog yourself)
100,000 sounds like a lot of sites. However, if you consider, global players (like Google or Microsoft), then big online retailers (like Tescos and Dell), then news sites (CNN, BBC) and national government information sites, you can see it must be pretty difficult to get into the club.
Beneath this blank chart, you will see “Percent of Internet users who visit this site” with a fraction of a percent if it’s anything like this one. (Maybe you’re Microsoft, in which case i guess it will be higher. Will check shortly.)
Then “average number of pages visited” and “3 months average traffic rank” (risibly low) and average page views per visitor (1) (1 Do you suspect that’s hard-coded?)
But the next bit is what creases me up for its randomness. People who visit this site come from (in order of most visits):
United States 40.0% (fair enough, the blog’s in English. Most English-speaking Internet users are in the USA)
Costa Rica 10.0%
United Kingdom 10.0%
Whydontyou.org.uk traffic rank in other countries: (These seem to be the same countries to me)
Costa Rica 46,349
United States 658,841
United Kingdom 703,872
Come on…. To what do we owe this unprecedented popularity in Costa Rica? India? France? This is a UK-based blog. Most of the stuff we witter on about, apart from atheism and technology, relates to the UK.
It’s not that I don’t want to believe it. A central American flavour to its posts would make this blog much more interesting. I just think the figures have been made up.
OK, let’s look at the sites that link here, according to Alexa. These are so out of date, that it’s obviously not been updated since the blog was a couple of months old. In fact, until I submitted a more recent image, Alexa had a screen shot of the blog that was well over a year old. (Yes, I know, that’s like saying “We don’t get enough spam here, please deluge us with as much as you can possibly manage”.) Maybe because of their age, the sites listed in some of these links are unrecognisable. In fact none of the blog links would be counted by Technorati, being over a year old, but then, it shows no links that Technorati counts (under 90 days.)
Let’s search for this blog on Google. Here, it’s wierder. There are few points of comparison between different Google results, if you repeat the search over a day or so. Maybe it’s just how Google treats blogs, but the post that comes up first is always the same one from a few months ago. Other posts can only be seen by asking for similar results, excluded the first time for being the same. Well, guess what Google, every post is different. It’s a blog. Lots of the other Google results for the blog are bits of the RSS feed. I’d like to think that lots of people are devouring the RSS feed, but, unfortunately, these tend to be link farms. In fact, lots of obscure references to the blog linkfarm sites turn up on Google, most being complete news to us. Real human-created references to the blog don’t turn up as often as they actually happen.
I could go on to the point where I was boring even myself.
None of this would matter if getting seen and indexed correctly wasn’t crucial to getting any visitors. I know that indexing engines and search engines are bomabarded with spammers trying every trick there is to get high on the first results page. The search engines have algorithms that are supposed to penalise sites and blogs that don’t match their definition of legitimate – density of keywords, number of inbound links, and so on. I believe that not only are these not working, they are often acting in exact reverse to their intentions.
Content from blogs get scraped and put into blag sites that exist just to spew out other people’s content. Google then decides the original source site has “duplicate” content and downranks it. How do you stop this without stopping legitimate blogs from commenting on your posts?
Keywords in the metatags don’t match teh keywords in the text? Well, duh, normal human beings aren’t thinking only of page rank. So they put keywords in their metatags then write content, without remembering to keep changing the metatags. Only people obsessed with search engine rankings do that and ,of course, a fair percentage of them aren’t just bloggers or normal website owners.
It’s not just a question of getting visitors. Anyone who wants to bring in revenue from their site or blog by displaying adverts gets judged by these bizarre standards. Some schemes base what they send you on your Alexa rating, which is itself derived from Google’s well-nigh arbitrary page rank . If you’ve ever tried to have GoogleAds on a site, you’ll see how abstract the GoogleAds process is. In fact, visitors who think they’re helping you pay for the site, so click a few times on your ads every time they visit will get you disqualified. Ditto, your rivals……. (It seems as if you get automatically disqualified anyway, at the very point that you might actually receive any revenue.)
I know it must be well nigh impossible to filter the enormous volume of material in the Internet, especially in the face of the number of spammers there are. However, there must be better ways of doing it. I am always amazed when people find things here and comment or email us about them. How do they manage to find it?
So here, is an unaccustomed prop for Technorati (unaccustomed for this blog, anyway, whioch has done its fair share of ranting about it). For all the irritating Technorati monster error messages and totally inconsistent service, Technorato remains the best performing indexing service that I’ve come across yet. The tags are really helpful when they work. You can still find an interesting read on someone’s first post. And Technorati isn’t yet totally under the sway of the giant players. The fabled Web 2.0 stuff really does still have something going for it.