Monday, August 29, 2005

Bloglines - PubSub Beats Google...

Bloglines user ArvindTM (arvindtm@gmail.com) has sent this item to you.


As I May Think...
An infrequently updated collection of comments on random subjects.

PubSub Beats Google...

By Bob Wyman on PubSub.com

I just stumbled across the wonderful grabPERF service today. This service provides metrics on the response time of most of the more popular "search engines" (both retrospective and prospective). Just recently, they started tracking PubSub's response time and I'm very proud to say that we're consistently rated as the fastest search engine in their rankings. For some fascinating data, I encourage you to check them out. A sample of their data, showing search engine response time averages and availabliity for the last 24 hours appears below. There is much more data on their site. Every search engine operator should be studying these numbers daily!

ALIAS AVERAGE TIME AVAILABILITY
PubSub - Search 0.397 100.00%
Google - Search 0.438 99.88%
MSN - Search 0.735 100.00%
Yahoo - Search 0.754 100.00%
Newsgator - Search 0.829 100.00%
eBay - Search 0.937 100.00%
Feedster - Search 1.007 100.00%
BlogLines - Search 1.249 100.00%
BestBuy.com - Search 1.315 100.00%
Technorati - Search ("Karl Rove") 1.725 99.77%
Blogdigger - Search 1.785 99.88%
Amazon - Search 1.839 99.20%
Technorati - Search 2.559 98.61%
IceRocket - Search 3.976 98.72%
Blogpulse - Search 4.993 100.00%

Of course, it should be recognized that the primary reason why PubSub is consistently the fastest is that we implement a completely different technology than the others on the list. PubSub implements a "prospective," forward-looking search while the others on the list are primarily providers of "retrospective" past-oriented search. PubSub stores user subscriptions (i.e. "queries") and then checks each new incoming document against each subscription the instant that a new document is discovered. The retrospective engines do the reverse. They store documents and then check queries against them as the queries arrive.

In the realm of blogs and syndication, the most common form of search is actually the prospective search. Today, prospective search of blogs is usually implemented by loading a "query URL" into an aggregator and then having the aggregator repeatedly execute that query on some search system. The aggregators usually re-execute each query once an hour -- 24 times per day -- but some do it more or less frequently. This method of doing prospective search is also implemented by a number of server-based systems that provide "Watch Lists" or "Alerts". This method is often refered to as "repeated retrospective search" and is the simplest, yet most inefficient, way to provide prospective search. It works under light load but doesn't scale without tremendous hardware expense. Of course, while one can use such brute force methods to produce the effect of a prospective search if you have massive hardware budgets, things work much better when you use a prospective engine -- like PubSub's.

Because we use a prospective matching engine, we're basically running every subscription query registered with us simultaneously and continuously throughout the day. In a prospective system users' results are built up incrementally as they arrive rather than in response to ad hoc queries. (This is much like what happens in a desktop aggregator.) What this means is that the cost of "fullfilling" a subscription query is spread across the day -- rather than concentrated at the time a query is received. Additionally, it means that we can totallly decouple our "matching engine" and the process of new document "ingestion" from the process of serving results. What you get is a system whose response time is largely insensitive to the volume of user requests since request processing is trivially simple -- a single request isn't much more expensive than serving a static web page. Also, because of our modular, decoupled design, the speed with which we serve up results isn't impacted by the amount of work that our ingestion and matching processes are doing. If matching load or ingestion load gets heavy, users never feel it. Similarly, if we have a burst of users fetching results, the matching and ingestion components don't feel it.

Some have recently suggested that a "Grid Computer" might be necessary to handle high-speed real time matching of the almost 15 million feeds that we monitor in real time and the average of almost 2 million blog entries that we process each day. Of course, since at PubSub we are still only using just a single dual-processor Intel Pentium box to handle all our matching, we're quite confident that going to the extreme of using exotic and expensive hardware isn't what's needed. What we need is what we've got: A team of dedicated, experienced engineers and some really good algorithms.

Being the best at anything is hard and something to be proud of. I must say that I'm very proud to be working with the team that we have at PubSub. It isn't easy to do the work needed to be able to say that you've built the "fastest search engine" on the web! This team has!

Yes, I know I'm bragging! :-)

bob wyman


0 Comments:

Post a Comment

<< Home