[Ex-plain] forwarded message from Mike Taylor

Mike Taylor mike at seatbooker.net
Thu Oct 24 12:19:57 CEST 2002


> Date: Wed, 23 Oct 2002 18:54:43 +0100 (BST)
> From: Robert Sanderson <azaroth at liverpool.ac.uk>
> 
> Apart from that, I don't really understand what the rest of the
> document is saying apart from that you can generate trivial ZeeRex
> documents if given a Z URL.

Well, that's pretty much it.

> The real problem is getting the list of Z URLs :) From there we can
> spider them with IndexData's ZeeRex constructor, or my own.

This reminds me that about six months ago, Alan more or less promised
to contribute a document for the ZeeRex website in which he would
describe the harvesting algorithm.  (Remember that, Alan?  :-) 
Any progress?  Is it OK if I start to gently prod you?

I don't have (to hand) a copy of Alan's message offering to do it, and
it's possible that it's a figment of my imagination.  But I do have a
copy of _my_ mail _to_ Alan, which I'm including below in the hope
that it's some use.

> Latest versions of Cheshire now ship with automatic ZeeRex
> construction from the configuration files for the database. ZeeRex
> will be the service description record for the DNER in the UK,
> meaning that there'll be LOTS of records available.

MOST EXCELLENT news!

 _/|_	 _______________________________________________________________
/o ) \/  Mike Taylor   <mike at miketaylor.org.uk>   www.miketaylor.org.uk
)_v__/\  "In theory, there is no difference between theory and
	 practice.  But in practice there is" -- Jan L. A. van de
	 Snepscheut.


----------------------------- old message ------------------------------
>From mike Thu Apr 25 12:17:31 +0100 2002
From: Mike Taylor <mike at tecc.co.uk>
To: ajk at mds.rmit.edu.au
CC: ex-plain at indexdata.dk
In-reply-to: <20020425203326.E16642 at io.mds.rmit.edu.au> (message from Alan
	Kent on Thu, 25 Apr 2002 20:33:26 +1000)
Subject: Re: [Ex-plain] Harvesting algorithm
References: <20020425142918.A15121 at io.mds.rmit.edu.au> <200204250952.KAA12605 at -f> <20020425203326.E16642 at io.mds.rmit.edu.au>

> Date: Thu, 25 Apr 2002 20:33:26 +1000
> From: Alan Kent <ajk at mds.rmit.edu.au>
>
> So, a harvester should:
> * Collect ZeeRex records from other databases
> * If the authoratitive flag is set,
> 	Set the flag to false
> 	Add the z3950r: URL into the aggregatedFrom element
> * I should always set/update the dateAggregated element(?)
> * Change the 'id' attribute to something locally unique(?)
> * Save new record in my local database
> * (If record is for IR-ZeeRex-1/IR-Explain---1 then add to list of
>   databases to harvest from)

Sounds perfect so far.
> 
> * For all records in my local database I also need to
>    - Query original source of record to see if still present
>        If not present, delete from local database

No, you only need to do that with records that you haven't already
refreshed from one of the servers that you've just crawled.  So before
you start crawling, you mark each record in your database as "not yet
refreshed", then each time you refresh one you clear that bit, and in
your post-crawl phase you only need to go back to source for those
records that still have the stale bit set.

> I guess really, since I need to check the authorative source for
> deletions, maybe I always try to hunt down the authoratative source
> for all records always (this includes the records describing
> IR-ZeeRex-1 databases).

I believe that approach would scale badly if we eventually get to the
kind of many-server network that we're all hoping for.

> [...] try to work out if two records I get back are for the same
> database (there is no permanent unique key, so this actually seems
> hard - I have to rely on host names not being changed to IP
> addresses etc).

Indeed.  Hadn't spotted this, and of course the same problem applies
for the more usual case of updating an existing record.  Suppose your
DB knows about z39.50s://explain.z3950.org:210/foo.  Then you find a
new record for it.  How do you know to replace the old one?

I _think_ it's legit to assume that the server-name/port/db
combination makes a unique key.  All that goes wrong on this basis is
that if a database moves (to a different host, port or DB name), the
crawler stats will list that as one old DB gone away and one new one
arrived, rather than recognising that old and new are the same; but
you'll end up with the same record in your IR-Explain---1 database, so
no real harm is done.

I think this is preferable to using the originalSource address (which
includes the record's unique ID in that original source DB, of course)
as a unique key: that approach would mean that if I stop having Index
Data host my application-DB's ZeeRex record, and get Rob to host it
instead, it appears to have changed.  That's wrong.

> Or, don't tell me, there is a complete algorithm already hosted
> on the web site that I did not know of...

No, but if you want to write yours up in bare-bones HTML, I'll gladly
put it up there, having glued on the appropriate headers, footers,
etc.

BTW., don't forget the site's search facility at
	http://explain.z3950.org/search.html

 _/|_	 _______________________________________________________________
/o ) \/  Mike Taylor   <mike at miketaylor.org.uk>   www.miketaylor.org.uk
)_v__/\  "Wait here Audrey: this is between me and the vegetable" --
	 Rick Moranis, "Little Shop of Horrors"






More information about the Ex-plain mailing list