1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.16 2006-11-30 10:29:23 adam Exp $ -->
3 <title>Overview of Zebra Architecture</title>
5 <section id="architecture-representation">
6 <title>Local Representation</title>
9 As mentioned earlier, Zebra places few restrictions on the type of
10 data that you can index and manage. Generally, whatever the form of
11 the data, it is parsed by an input filter specific to that format, and
12 turned into an internal structure that Zebra knows how to handle. This
13 process takes place whenever the record is accessed - for indexing and
18 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
19 the <literal>-t</literal> option to the indexer tells Zebra how to
20 process input records.
21 Two basic types of processing are available - raw text and structured
22 data. Raw text is just that, and it is selected by providing the
23 argument <emphasis>text</emphasis> to Zebra. Structured records are
24 all handled internally using the basic mechanisms described in the
26 Zebra can read structured records in many different formats.
28 How this is done is governed by additional parameters after the
29 "grs" keyword, separated by "." characters.
34 <section id="architecture-maincomponents">
35 <title>Main Components</title>
37 The Zebra system is designed to support a wide range of data management
38 applications. The system can be configured to handle virtually any
39 kind of structured data. Each record in the system is associated with
40 a <emphasis>record schema</emphasis> which lends context to the data
41 elements of the record.
42 Any number of record schemas can coexist in the system.
43 Although it may be wise to use only a single schema within
44 one database, the system poses no such restrictions.
47 The Zebra indexer and information retrieval server consists of the
48 following main applications: the <command>zebraidx</command>
49 indexing maintenance utility, and the <command>zebrasrv</command>
50 information query and retrieval server. Both are using some of the
51 same main components, which are presented here.
54 The virtual Debian package <literal>idzebra-2.0</literal>
55 installs all the necessary packages to start
56 working with Zebra - including utility programs, development libraries,
57 documentation and modules.
60 <section id="componentcore">
61 <title>Core Zebra Libraries Containing Common Functionality</title>
63 The core Zebra module is the meat of the <command>zebraidx</command>
64 indexing maintenance utility, and the <command>zebrasrv</command>
65 information query and retrieval server binaries. Shortly, the core
66 libraries are responsible for
69 <term>Dynamic Loading</term>
71 <para>of external filter modules, in case the application is
72 not compiled statically. These filter modules define indexing,
73 search and retrieval capabilities of the various input formats.
78 <term>Index Maintenance</term>
80 <para> Zebra maintains Term Dictionaries and ISAM index
81 entries in inverted index structures kept on disk. These are
82 optimized for fast inset, update and delete, as well as good
88 <term>Search Evaluation</term>
90 <para>by execution of search requests expressed in PQF/RPN
91 data structures, which are handed over from
92 the YAZ server frontend API. Search evaluation includes
93 construction of hit lists according to boolean combinations
94 of simpler searches. Fast performance is achieved by careful
95 use of index structures, and by evaluation specific index hit
96 lists in correct order.
101 <term>Ranking and Sorting</term>
104 components call resorting/re-ranking algorithms on the hit
105 sets. These might also be pre-sorted not only using the
106 assigned document ID's, but also using assigned static rank
112 <term>Record Presentation</term>
114 <para>returns - possibly ranked - result sets, hit
115 numbers, and the like internal data to the YAZ server backend API
116 for shipping to the client. Each individual filter module
117 implements it's own specific presentation formats.
124 The Debian package <literal>libidzebra-2.0</literal>
125 contains all run-time libraries for Zebra, the
126 documentation in PDF and HTML is found in
127 <literal>idzebra-2.0-doc</literal>, and
128 <literal>idzebra-2.0-common</literal>
129 includes common essential Zebra configuration files.
134 <section id="componentindexer">
135 <title>Zebra Indexer</title>
137 The <command>zebraidx</command>
138 indexing maintenance utility
139 loads external filter modules used for indexing data records of
140 different type, and creates, updates and drops databases and
141 indexes according to the rules defined in the filter modules.
144 The Debian package <literal>idzebra-2.0-utils</literal> contains
145 the <command>zebraidx</command> utility.
149 <section id="componentsearcher">
150 <title>Zebra Searcher/Retriever</title>
152 This is the executable which runs the Z39.50/SRU/SRW server and
153 glues together the core libraries and the filter modules to one
154 great Information Retrieval server application.
157 The Debian package <literal>idzebra-2.0-utils</literal> contains
158 the <command>zebrasrv</command> utility.
162 <section id="componentyazserver">
163 <title>YAZ Server Frontend</title>
165 The YAZ server frontend is
166 a full fledged stateful Z39.50 server taking client
167 connections, and forwarding search and scan requests to the
171 In addition to Z39.50 requests, the YAZ server frontend acts
172 as HTTP server, honoring
173 <ulink url="&url.srw;">SRU SOAP</ulink>
175 <ulink url="&url.sru;">SRU REST</ulink>
176 requests. Moreover, it can
178 <ulink url="&url.cql;">CQL</ulink>
180 <ulink url="&url.yaz.pqf;">PQF</ulink>
182 correctly configured.
185 <ulink url="&url.yaz;">YAZ</ulink>
187 toolkit that allows you to develop software using the
188 ANSI Z39.50/ISO23950 standard for information retrieval.
189 It is packaged in the Debian packages
190 <literal>yaz</literal> and <literal>libyaz</literal>.
194 <section id="componentmodules">
195 <title>Record Models and Filter Modules</title>
197 The hard work of knowing <emphasis>what</emphasis> to index,
198 <emphasis>how</emphasis> to do it, and <emphasis>which</emphasis>
199 part of the records to send in a search/retrieve response is
201 various filter modules. It is their responsibility to define the
202 exact indexing and record display filtering rules.
205 The virtual Debian package
206 <literal>libidzebra-2.0-modules</literal> installs all base filter
211 <section id="componentmodulestext">
212 <title>TEXT Record Model and Filter Module</title>
214 Plain ASCII text filter. TODO: add information here.
218 <section id="componentmodulesgrs">
219 <title>GRS Record Model and Filter Modules</title>
221 The GRS filter modules described in
222 <xref linkend="grs"/>
223 are all based on the Z39.50 specifications, and it is absolutely
224 mandatory to have the reference pages on BIB-1 attribute sets on
225 you hand when configuring GRS filters. The GRS filters come in
226 different flavors, and a short introduction is needed here.
227 GRS filters of various kind have also been called ABS filters due
228 to the <filename>*.abs</filename> configuration file suffix.
231 The <emphasis>grs.marc</emphasis> and
232 <emphasis>grs.marcxml</emphasis> filters are suited to parse and
233 index binary and XML versions of traditional library MARC records
234 based on the ISO2709 standard. The Debian package for both
236 <literal>libidzebra-2.0-mod-grs-marc</literal>.
239 GRS TCL scriptable filters for extensive user configuration come
240 in two flavors: a regular expression filter
241 <emphasis>grs.regx</emphasis> using TCL regular expressions, and
242 a general scriptable TCL filter called
243 <emphasis>grs.tcl</emphasis>
244 are both included in the
245 <literal>libidzebra-2.0-mod-grs-regx</literal> Debian package.
248 A general purpose SGML filter is called
249 <emphasis>grs.sgml</emphasis>. This filter is not yet packaged,
250 but planned to be in the
251 <literal>libidzebra-2.0-mod-grs-sgml</literal> Debian package.
255 <literal>libidzebra-2.0-mod-grs-xml</literal> includes the
256 <emphasis>grs.xml</emphasis> filter which uses <ulink
257 url="&url.expat;">Expat</ulink> to
258 parse records in XML and turn them into IDZebra's internal GRS node
259 trees. Have also a look at the Alvis XML/XSLT filter described in
264 <section id="componentmodulesalvis">
265 <title>ALVIS Record Model and Filter Module</title>
267 The Alvis filter for XML files is an XSLT based input
269 It indexes element and attribute content of any thinkable XML format
270 using full XPATH support, a feature which the standard Zebra
271 GRS SGML and XML filters lacked. The indexed documents are
272 parsed into a standard XML DOM tree, which restricts record size
273 according to availability of memory.
277 uses XSLT display stylesheets, which let
278 the Zebra DB administrator associate multiple, different views on
279 the same XML document type. These views are chosen on-the-fly in
283 In addition, the Alvis filter configuration is not bound to the
284 arcane BIB-1 Z39.50 library catalogue indexing traditions and
285 folklore, and is therefore easier to understand.
288 Finally, the Alvis filter allows for static ranking at index
289 time, and to to sort hit lists according to predefined
290 static ranks. This imposes no overhead at all, both
291 search and indexing perform still
292 <emphasis>O(1)</emphasis> irrespectively of document
293 collection size. This feature resembles Googles pre-ranking using
294 their Pagerank algorithm.
297 Details on the experimental Alvis XSLT filter are found in
298 <xref linkend="record-model-alvisxslt"/>.
301 The Debian package <literal>libidzebra-2.0-mod-alvis</literal>
302 contains the Alvis filter module.
307 <section id="componentmodulessafari">
308 <title>SAFARI Record Model and Filter Module</title>
310 SAFARI filter module TODO: add information here.
320 <section id="architecture-workflow">
321 <title>Indexing and Retrieval Workflow</title>
324 Records pass through three different states during processing in the
334 When records are accessed by the system, they are represented
335 in their local, or native format. This might be SGML or HTML files,
336 News or Mail archives, MARC records. If the system doesn't already
337 know how to read the type of data you need to store, you can set up an
338 input filter by preparing conversion rules based on regular
339 expressions and possibly augmented by a flexible scripting language
341 The input filter produces as output an internal representation,
349 When records are processed by the system, they are represented
350 in a tree-structure, constructed by tagged data elements hanging off a
351 root node. The tagged elements may contain data or yet more tagged
352 elements in a recursive structure. The system performs various
353 actions on this tree structure (indexing, element selection, schema
361 Before transmitting records to the client, they are first
362 converted from the internal structure to a form suitable for exchange
363 over the network - according to the Z39.50 standard.
372 <section id="special-retrieval">
373 <title>Retrieval of Zebra internal record data</title>
375 Starting with <literal>Zebra</literal> version 2.0.5 or newer, it is
376 possible to use a special element set which has the prefix
377 <literal>zebra::</literal>.
380 Using this element will, regardless of record type, return
381 Zebra's internal index structure/data for a record.
382 In particular, the regular record filters are not invoked when
384 This can in some cases make the retrival faster than regular
385 retrieval operations (for MARC, XML etc).
387 <table id="special-retrieval-types">
388 <title>Special Retrieval Elements</title>
392 <entry>Element Set</entry>
393 <entry>Description</entry>
394 <entry>Syntax</entry>
399 <entry><literal>zebra::meta::sysno</literal></entry>
400 <entry>Get Zebra record system ID</entry>
401 <entry>XML and SUTRS</entry>
404 <entry><literal>zebra::data</literal></entry>
405 <entry>Get raw record</entry>
409 <entry><literal>zebra::meta</literal></entry>
410 <entry>Get Zebra record internal metadata</entry>
411 <entry>XML and SUTRS</entry>
414 <entry><literal>zebra::index</literal></entry>
415 <entry>Get all indexed keys for record</entry>
416 <entry>XML and SUTRS</entry>
420 <literal>zebra::index::</literal><replaceable>f</replaceable>
423 Get indexed keys for field <replaceable>f</replaceable> for record
425 <entry>XML and SUTRS</entry>
429 <literal>zebra::index::</literal><replaceable>f</replaceable>:<replaceable>t</replaceable>
432 Get indexed keys for field <replaceable>f</replaceable>
433 and type <replaceable>t</replaceable> for record
435 <entry>XML and SUTRS</entry>
441 For example, to fetch the raw binary record data stored in the
442 zebra internal storage, or on the filesystem, the following
443 commands can be issued:
445 Z> f @attr 1=title my
447 Z> elements zebra::data
457 <literal>zebra::data</literal> element set name is
458 defined for any record syntax, but will always fetch
459 the raw record data in exactly the original form. No record syntax
460 specific transformations will be applied to the raw record data.
463 Also, Zebra internal metadata about the record can be accessed:
465 Z> f @attr 1=title my
467 Z> elements zebra::meta::sysno
470 displays in <literal>XML</literal> record syntax only internal
471 record system number, whereas
473 Z> f @attr 1=title my
475 Z> elements zebra::meta
478 displays all available metadata on the record. These include sytem
479 number, database name, indexed filename, filter used for indexing,
480 score and static ranking information and finally bytesize of record.
483 Sometimes, it is very hard to figure out what exactly has been
484 indexed how and in which indexes. Using the indexing stylesheet of
485 the Alvis filter, one can at least see which portion of the record
486 went into which index, but a similar aid does not exist for all
487 other indexing filters.
491 <literal>zebra::index</literal> element set names are provided to
492 access information on per record indexed fields. For example, the
495 Z> f @attr 1=title my
497 Z> elements zebra::index
500 will display all indexed tokens from all indexed fields of the
501 first record, and it will display in <literal>SUTRS</literal>
502 record syntax, whereas
504 Z> f @attr 1=title my
506 Z> elements zebra::index::title
508 Z> elements zebra::index::title:p
511 displays in <literal>XML</literal> record syntax only the content
512 of the zebra string index <literal>title</literal>, or
513 even only the type <literal>p</literal> phrase indexed part of it.
517 Trying to access numeric <literal>Bib-1</literal> use
518 attributes or trying to access non-existent zebra intern string
519 access points will result in a Diagnostic 25: Specified element set
520 'name not valid for specified database.
527 <!-- Keep this comment at the end of the file
532 sgml-minimize-attributes:nil
533 sgml-always-quote-attributes:t
536 sgml-parent-document: "zebra.xml"
537 sgml-local-catalogs: nil
538 sgml-namecase-general:t