25 May 2005

Does search need metadata (schemas)?

I recently read through a case study that James Robertson points to on metadata based search and browsing functionality. I'm not going to talk about the browsing aspect of that study. I am going to talk about the metadata-based search facility however.

First up, let me state my opinion: I do not believe that metadata is a solution / aid / requirement for good search. Either inside an enterprise or on the Web.

Why so? After all, governments around the world have recommended or mandated the use of metadata with the stated intention of improving discoverability.

A short digression about precision

When I talk about search, what I often really mean is precision - that is, finding relevant results within some number of resources that are retrieved.

In the information retrieval community (as exemplified by the people who organise and participate in TREC), precision is often measured with respect to a cutoff value such as 10 or 50. For example, precision @ 10 is a measure of how many results are relevant for your search in the first 10 results returned.

One of the reasons Google was so successful (there are several), was that they did/do a great job of returning high precision @ 1! This was their "I'm feeling lucky" search. In other words, very often the first result was a relevant result. If you want to see how precision works in practice, try a search on your favourite search engine, and count how many of the results are relevant to you in the top 10 that are returned.

Average precision is a metric used to assess how well a search engine performs over a number of queries, for a particular precision level. So if a search engine on average obtains 4 out of 10 documents relevant, then its average precision will be 0.4.

Why does all this stuff from information retrieval research matter?

Well, firstly, studies have shown that two humans agree about whether a document is relevant or not only 80% of the time. Therefore, it is understood that if humans agree only 80% of the time (at best), then search engines will only be able to achieve the same (at best). Average precision is thus absolutely limited to 0.8.

Secondly, in practice, for common kinds of general information queries, search engines typically have difficulty achieving much beyond 0.4 in their average precision, and for difficult queries, often as little as 0.2.

So what can be done to try and improve this?

Back to metadata

Several years ago, people decided that having metadata - information about a resource - would be really helpful for search engines trying to improve their ranking algorithms.

And it turns out that some kinds of metadata (in the loose sense of the term) are useful. Google for instance uses information about the links between documents to help score their relevance.

This kind of metadata however, is not the kind of metadata referred to in the case study. The latter kind is that captured in metadata schemas, such as AGLS, Dublin Core, or even just plain old Netscape metatags. It's usually embedded within a document (at least when that document is delivered through a web server). This is the type of metadata that is often recommended should be added to documents in an enterprise setting to help discoverability, and it's this kind of metadata that I'm going to refer to throughout the rest of this post.

Due to the presence of people who would like to influence their document rankings on search engines for particular queries, metadata embedded in a document is completely ignored by search engines on the Web.

Why? Because the metadata would contain all sorts of query terms that were just not relevant at all to the document itself. So for example, a porn site might include metadata subject terms about farming, to try and make their site appear serendipitously when people were searching for farming information.

But search engine designers got smarter and decided to just ignore all embedded textual metadata for the purpose of ranking. (A side note, Andrei Broder, once chief researcher for Alta Vista and now at IBM Almaden, described the tension between the interests of web site publishers and of search engine companies, as a constant war being waged against the spammers.)

And guess what, we've got fantastic search engines on the Web that can search 3 billion documents in a couple of seconds and give you a really good set of results. All without metadata at all.

Enterprise search and metadata

So how come a much simpler task, such as searching a few thousand documents, in an enterprise environment where there is no spamming going on, suddenly requires metadata to help out the search?

Answer: it doesn't.

What it does need is a good search engine (e.g. Panoptic), that takes advantage of all the other contextual information that exists in documents and web sites to help produce good results. There are a whole lot of very clever algorithms used in modern search engines to improve ranking of results. These algorithms use not just probabilitic scoring, but lots of other information about the hyperlinked documents from web sites.

In an enterprise setting, it may be that certain queries have known answers. For example, a search for leave form in an organisation with only one leave form, should return as its first result the link to the leave form. Enterprise search engines often provide a facility for mapping particular queries to known results to address this issue. Their ranking algorithms may do a good job, but identifying common queries which have known answers can help people searching for this information. However, this doesn't rely on metadata being present, other than as directed search engine mappings.

Even subject-specific thesaurii and/or controlled vocabularies - common elements of metadata schemas - do not need to be applied to documents themselves. Again, enterprise search engines can use such thesaurii either by expanding queries (e.g. you search for vitamin A and the search engine automatically puts ascorbic acid into the query for you, or returns the results for vitamin A, and asks if you would also like to search for ascorbic acid. (After all, you might have been looking for all documents that had vitamin A in them, so that you could and change them to say ascorbic acid instead.)

Where now, metadata?

Don't get me wrong, I do think there is a role for metadata. It's great for record-keeping purposes. Say I want to find all articles authored by a certain person, or created in a particular year. In these circumstances, accurate metadata is essential.

Such search activities are measured in the information retrieval community by the recall metric (which counts how many relevant documents - of the entire set of relevant documents - have been retrieved when some number of documents overall have been retrieved).

In a modern context however, assuming the existence of a content management system such as Sytadel, this activity is much better left to the CMS: firstly in assigning the metadata accurately and secondly in carrying out the retrieval activity (which is typically just a straightforward database query).

But in terms of improving your users' ability to search for documents in the way the people expect to search these days (that is, by issuing two or three query terms to the search engine and getting back a relevant set of results in a couple of seconds), metadata is completely irrelevant (excuse the pun).


For more reading on this issue, read Cory Doctorow's delightful article Metacrap - Putting the torch to seven straw men of the meta-utopia.

For more reading on the fundamentals of search, see Tim Bray's series On Search. (Tim takes the broad view of metadata, not the narrow schema-based view I discuss here. I also think he ascribes too much weight to Google's PageRank value as a significant component of Google's result ranking algorithm, but that's another story.)

I'm indebted to David Hawking for discussions over several years on the subject of metadata and search. I'm also looking forward to an upcoming study from him and Justin Zobel that he mentioned to me yesterday, which sets about objectively measuring the effectiveness of metadata-based search versus non-metadata based search in an enterprise setting with extensive metadata.

0 Comments:

Post a Comment

<< Home