« FontStruct: Build Your Own Fonts | Main | How to Pay for Graduate School »
Friday
09May2008

The Rarest Words, Part II

A few months ago, I wrote about a strange website that was showing up in my referer logs. It was called TheRarestWords.com and it appeared to be keeping track of all the rare words that appeared on each website that its little bot visited (called RarestParser).  They made a page on their site for each of these lists of rare words (mine was http://therarestwords.com/mxrk.net), and for a while it looked like they were making a lot of progress.  Then they went away.

I was skeptical of RarestWords because of the lack of information they provided about their project or who they were, and because the site featured a number of adds for “Search Engine Optimization,” a field which has a tendency to attract some pretty shady characters

Now RarestWords is back, and they look busy.  Most of the old pages don’t work anymore, and they’re describing themselves a little differently.  Compare this to their old copy:

What’s this? If you look at just the main pages of all sites in .com/.net/.org zones - you’ll see more than 17 millions words. Some of them got to be telling some interesting stories and that’s what we’re going to find out. Come back later when the system is really working (right now it’s more of a bunch of raw science stuff).

And that’s from about the third revision of the text I’ve seen today.  Pretty weird, huh?  Try looking up their WhoIs data and you’ll get a big, fat zero.  They’ve registered through a proxy registrar so you can’t see who they are. I tried to do a little snooping last time, but I came up empty handed.  Judging from the amount of traffic I’m getting for people looking up “rarest words,” it looks like they’re pretty active these days.  Anybody know anything?

Reader Comments (5)

Hi, I'm the owner of TheRarestWords site. I can't imagine why there's so much atttention from you to it, but still I'll try to answer the question.

Domain has hidden registrar info is because I don't want people to know who am I. Why? I just don't want to be linked my my other projects. This one is just kind of my personal research into words, because it has nothing to do with my main work and it's a hobby (the "SEO" ads couldn't even pay for the server and traffic this site uses). There are stupid people around who like to harass anything new and I don't want to hear from them - or looking into my past or future projects.

Going "dark" as you say - that was my datacenter giving me a first HDD failure in January and installing a HDD that's been seeing the light already next time. Two major crashes and database is gone. After that I've had no time to work on this (it's a hobby, remember), so I abandoned it. Now it's back.

It started with the discovery of the word "django" - you probably could find what it is in Internet - a framework. As with many things - that was completely random. And I thought - what if I visit all the pages in Internet and compile a huge database of keywords and try to find upcoming trends in them? Maybe even write about them.

Well, that I did. And since the second parsing of main pages is not finished yet (it's going to be done in 2 days - walking around whole Internet isn't that easy or fast). Meanwhile I thought it might be fun to give people a way to write something about a word. Like "django - a web framework". And that I did. :)

So right now people are having fun around the site - writing 100 letters about anything. So are trying to spam, but most are writing pretty clever stuff. I write a lof of stuff there too and clean up what's stupid. Also a lot of fight with domain sponsors - those full-domain ads are giving me real pain.

I plan to reveal graphs in a few days and auto-categorizer site ("What 100 categories your site would fit into in DMOZ or Wikipedia?" - it's ready and what's more interesting - works pretty darn good, just horribly slow, so the release date is unknown - right now it would take 1 server about 10 years to categorize all web sites in Internet).

Amazon services is nothing of "red flag" - it's just a way to get a lof of computing power that is needed to crawl the internet - it's analysis of 100+ million domains, you know?

If you have any other questions that I haven't answered - ask away.

May 17, 2008 | Unregistered CommenterTheRarestWords

Specially for you - http://therarestwords.com/mxrk.net - parsed manually :) If you have time to laugh - take a look at google's page. Too bad a lot of great crowd-wisdom get's verwritten. But I'll do something about that.

May 17, 2008 | Unregistered CommenterTheRarestWords

Hey, thanks a lot, that's awesome. I think you've answered all my questions, and judging from the amount of keyword traffic I'm getting for RarestWords, you've probably answered a lot of other people's questions as well. Most people spot referrer traffic from you and wonder what Rarest Words is all about (that's why I started trying to figure it out).

The crowd wisdom component seems really interesting.

May 17, 2008 | Unregistered CommenterMxrk

No problem. There are a lot of ideas about the project right now - but so little time and money to do that :) Im going to open a blog and write about the project and it's evolution soon (it's going to spread into "mom-and-dad seo" area beside crowdsourcing and trendspotting). Hope to see your comments there. Just please no more "steal underpants" references :) I was drinking vodka (yes, I'm Russian) all day when I read that to understand why did I deserve that.

May 17, 2008 | Unregistered CommenterTheRarestWords

And the crowd wisdom part (which is an accident development) really amuses me: (there's no moderation on site except for "zxcxczczxxz" stuff and blatant advertising)
Crowd wisdom: "finance": A male person who manges money. The female person is a finacee'.

May 17, 2008 | Unregistered CommenterTheRarestWords

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>