[Zebralist] indexing of full text

Eric Lease Morgan emorgan at nd.edu
Fri Jan 19 03:30:02 CET 2007

Readers of this list may be interested in a technique I have started  
using regarding the indexing of full text.

   1. I first collected sets of plain text
      documents -- mostly books.

   2. I then created really simple HTML
      documents by wrapping the plain text
      in pre elements and supplementing
      the whole thing with rudimentary
      metadata [1]

   3. Next, I surrounded the same plain
      text with a RDF wrapper complete
      with same rudimentary metadata [2]

   4. I then used an Alvis filter to
      index my RDF extracting things like
      title, author, tags, as well as all the
      full text.

   5. I then configured zebrasrv to be an SRU

   6. I wrote a really simple SRU client, and
      searches from the client only return
      author/title combinations hyperlinked
      to the .htm file(s).

Using this technique I have full text as well as fielded indexing and  
searching via SRU. Returning only a pointer to the .htm document  
allows readers to get the full text without "thinking" about whether  
or not the document in question is relevant. (All the metadata is  
displayed on the .htm file). See:


(Fielded and explicitly articulated phrase searches return results  
much faster. Examples include title=flowers, author=plato, "Benjamin  
Jowett", or "metaphysical declarations".)

So far I have only indexed ~600 files. After I play with selected  
texts from Project Gutenberg I will be using the same technique  
against ~14,000 files. When I get that far I will start thinking  
about displaying snippets of the texts in the search results, as well  
as other enhancements such as Did You Mean, sorting, etc.


[1] http://infomotions.com/etexts/philosophy/400BC-301BC/aristotle- 
[2] http://infomotions.com/etexts/philosophy/400BC-301BC/aristotle- 

Eric Lease Morgan
University Libraries of Notre Dame

(574) 631-8604

More information about the Zebralist mailing list