25 May 2005

Why distribution requires standardisation

I’ve recently been reflecting on distributed computing again. Or more precisely, what it is that should be distributed, and how we can build effective services with it.

Years ago, I worked as part of team building a distributed Smalltalk implementation for OTI (part of IBM). The lesson from this work was that ad hoc distributed computation is hard … very hard. The current GRID computing initiative is addressing this difficult problem in a generic fashion. I wish them luck. The difficulty lies I believe in the ad hoc nature of the systems which must be connected when carrying out generic computation.

Ad hoc distributed data is much easier to handle, so long as it’s text (often in some commonly readable format such as HTML). Google and other general purpose search engines are a great example of a service over distributed data. Perhaps the single biggest factor which makes it possible for them to work so comprehensively is that there is no additional complexity required from the individual computer systems which manage the data – as they already have web server software serving up the pages.

As soon as we wish to build more complex services over distributed data, or the data itself gets more complex to generate, we need additional computing resources involved. Instant messaging is a great example – the distributed data is people who wish to have ad hoc textual conversations in real time. To this end, people who wish to participate must install and operate a new bit of software – an instant messaging client such as Trillian or Yahoo Messenger.

Another example I was told about today is the Open Archives Initiative, supported by software from OCLC such as OAICat. Finding ways to efficiently harvest publication information from a wide variety of collection sources (especially large national libraries) means that you can’t just take the obvious approach of sucking down all the catalogue metadata, or you’ll be there for several days whenever three new publications are added. The OAICat software provides information about the catalogue on an incremental basis. For example, tell me about all the new publications in the last week.

In Sytadel, we call this filtering, and it is used everywhere when deploying web sites, as it allows us to provide links to restricted sets of information. For example, creating a link to all the news releases published in a particular year.

Both OAICat and Sytadel are examples of the more complex computing resources needed to make better use of rich data. You definitely don’t wish to trawl through all the news releases ever published by an organisation. This approach is all about providing custom data query interfaces.

Building a central service which draws together distributed collections of rich data relies on knowing what the computational systems do and how to interact with their data query interfaces. Thus it will rely on standardised software running on each of the distributed nodes, or on the distributed nodes responding to standardised query interfaces. Even in the case of general purpose text search engines, the distributed nodes (web sites) all respond to a standard query interface (HTTP). And without standardisation, you can't build interesting services over the top.

0 Comments:

Post a Comment

<< Home