[ZOOM] Re: Scope of ZOOM (was Hello etc)
quinn at indexdata.dk
Wed Nov 7 22:06:02 CET 2001
>OK, I appreciate that this was written in jest, but it does make the
>think. We've come far enough with ZOOM that it's clearly a project of
>substance rather than a mere whimsical imagining. So given the
>current level of maturity, _does_ anyone have any idea whether it
>might be possible to get funding from _somewhere_ to continue this
>effort? It would be very nice ...
4-6 years ago, I'd say we might have a chance of putting together a
consortium and getting funding... today, I would really doubt it. At least,
we would need to bring in support for more sexy IR protocols to do it, if
we can *find* any more sexy IR protocol.
> > [ZOOM] shouldn't try to seamlessly hide multi-target searching,
> > because that is a job for smarter folks -- specifically folks closer
> > to the application and the data being exchanged.
>Interesting. If we end up agreeing with you here (which we might)
>then that militates against trying for something like Ian's
>BadgeringGreatAggregationOfLotsOfDifferentConnection type. Not sure
>yet how I feel about that.
I have been happy with the notion of your "manager" in the PErl layer
(except I prefer for it to be implicit and invisible and unmentioned to the
greatest degree possible). As much as I think a great big transparent
merging thing would be cool, I also shudder at the complexity... but hey,
maybe there's a way to cut down on that.
Ian, how much does your API layer do in terms of merging results from
different sources behind the scenes?
Here's a few of the problems associated with result set merging.
First, of all, the result of a search is a result set handle, or X handles,
in the case of a multi-target search. You don't actually have any records
in hand until you start retrieving them. What's your result set count? If
your databases hold totally distinct sets of records, then the merging
process need not worry about duplicates and you can simply add the result
set counts from the different servers. That's not the general case,
however... a decent merging of result sets should look to remove duplicates
-- this absolutely holds in bibliographic types of systems, where you'll
frequently find entries for the same book/journal/whatever in different
places. In some applications, you don't care about the result set count
anyway, or you're happy updating it progressively while you're getting more
Once you start getting records down from the server, you can start to merge
them. If they arrive nicely sorted by whatever field you use to merge them
by, then this is a nice, smooth process. If this sort field happens to be
the one your user wants results sorted by, then that's peachy, you'll have
a screenful of beautifully sorted and merged records ready in no time, right?
But this is not the general case, or even a common case, or even one that I
have ever encountered. The normal case in a mixed group of servers is that
the default sort order is completely random from your point of view. Less
than half of the servers support the SORT facility, and those that do
probably don't support the same parameters, because the use of sorting is
completely unprofiled. Even if it was widely supported, you would have
issues like whether you exclude leading articles ("the", "a", etc.) from,
say, title fields, whether you put surnames before or after first names in
author fields, etc. Except in very controlled environments (basically those
where you have written all of the clients and servers yourself), it's very
hard to get everybody to sort the same way. Add to this that in libraries,
you're typically merging by a host of different fields, and not necessarily
those you want to sort by anyway. If the data model is complex and you have
a lot of fallible humans doing the data entry, your merging routine needs
heuristics to deal with variations in people's use of the data entry
guidelines. Consider again bibliographic systems where the same title may
exist in hardcover, paperback, as an audio tape, or a DVD of the Hollywood
production. Add to that different editions, revisions, etc., and it becomes
clear that more than a call to strcmp() is involved. Bibliographic merging
at its best (or worst) is a Zen art.
So... assuming someone wants to build a user interface that shows a brief
list of records on the screen, it's not going to be of much use to retrieve
just half a dozen or so from each server. You can trick people with this
for a demonstration, but it's not going to work for something people will
In principle, you're going to need to get *all* of the records that match
your search from each server, sort them, deduplicate them, count 'em, and
show them to the user. This works if you get a few tens of hits per server,
but otherwise, it's a job that never finishes. BookWhere, which is probably
the only commercial stand-alone Windows client that has hit it off,
essentially doesn't stop fetching records. It retrieves and merges,
retrieves and merges, all the time updating the nicely sorted display of
records on the screen. But this has drawbacks too. Performance-tests I have
done have suggested that for many servers, retrieving records is at least
as expensive as executing a search, if not more so. Database servers that
construct retrieval records by pulling together data from an RDBMS have to
do a lot of legwork to retrieve a record. Also, a continuously updating
display can be done relatively prettily in a Windows interface, but it
requires trickery to get it to work well in a web application.
*maybe* we can get around some of these issues by adding in hooks where you
can inherit/override classes to put in your own heuristics, but it more
than just sorting and matching -- the whole strategy for how you do the
retrieval efficiently should be adapted to the kind of merging/matching you
need to do.
In Z39.50 implementor's terms, I'm middle-aged if not quite a senior
citizen, and the last thing I want to do is discourage innovation.. I just
want to get across that it is more intricate than it appears the first time
you consider it. In our projects, we mostly just ignore it and display
results separately (although we obviously do the Z39.50 network stuff in
parallel), but we're beginning to have a hard time convincing customers
that this is a good thing. Many client systems cheat and just display
results at random or round-robin from different servers... but I would
claim that those systems won't be popular with users in the long run.
I really think we're biting off more than we can chew if we try to put
result set merging into the guts of ZOOM -- it'll bring more confusion than
help. You can have a HeapingGreatSackofHeterogeneousConnectionsAndStuff,
but it should provide access to individual result sets.
More information about the ZOOM