1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.1 2006-01-18 14:00:54 marc Exp $ -->
3 <title>Overview of Zebra Architecture</title>
6 <sect1 id="local-representation">
7 <title>Local Representation</title>
10 As mentioned earlier, Zebra places few restrictions on the type of
11 data that you can index and manage. Generally, whatever the form of
12 the data, it is parsed by an input filter specific to that format, and
13 turned into an internal structure that Zebra knows how to handle. This
14 process takes place whenever the record is accessed - for indexing and
19 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
20 the <literal>-t</literal> option to the indexer tells Zebra how to
21 process input records.
22 Two basic types of processing are available - raw text and structured
23 data. Raw text is just that, and it is selected by providing the
24 argument <emphasis>text</emphasis> to Zebra. Structured records are
25 all handled internally using the basic mechanisms described in the
27 Zebra can read structured records in many different formats.
29 How this is done is governed by additional parameters after the
30 "grs" keyword, separated by "." characters.
36 <title>Indexing and Retrieval Workflow</title>
39 Records pass through three different states during processing in the
49 When records are accessed by the system, they are represented
50 in their local, or native format. This might be SGML or HTML files,
51 News or Mail archives, MARC records. If the system doesn't already
52 know how to read the type of data you need to store, you can set up an
53 input filter by preparing conversion rules based on regular
54 expressions and possibly augmented by a flexible scripting language
56 The input filter produces as output an internal representation,
64 When records are processed by the system, they are represented
65 in a tree-structure, constructed by tagged data elements hanging off a
66 root node. The tagged elements may contain data or yet more tagged
67 elements in a recursive structure. The system performs various
68 actions on this tree structure (indexing, element selection, schema
76 Before transmitting records to the client, they are first
77 converted from the internal structure to a form suitable for exchange
78 over the network - according to the Z39.50 standard.
88 <sect1 id="maincomponents">
89 <title>Main Components</title>
91 The Zebra system is designed to support a wide range of data management
92 applications. The system can be configured to handle virtually any
93 kind of structured data. Each record in the system is associated with
94 a <emphasis>record schema</emphasis> which lends context to the data
95 elements of the record.
96 Any number of record schemas can coexist in the system.
97 Although it may be wise to use only a single schema within
98 one database, the system poses no such restrictions.
101 The Zebra indexer and information retrieval server consists of the
102 following main applications: the <literal>zebraidx</literal>
103 indexing maintenance utility, and the <literal>zebrasrv</literal>
104 information query and retireval server. Both are using some of the
105 same main components, which are presented here.
108 This virtual package installs all the necessary packages to start
109 working with IDZebra - including utility programs, development libraries,
110 documentation and modules.
111 <literal>idzebra1.4</literal>
114 <sect2 id="componentcore">
115 <title>Core Zebra Module Containing Common Functionality</title>
117 - loads external filter modules used for presenting
118 the recods in a search response.
119 - executes search requests in PQF/RPN, which are handed over from
120 the YAZ server frontend API
121 - calls resorting/reranking algorithms on the hit sets
122 - returns - possibly ranked - result sets, hit
123 numbers, and the like internal data to the YAZ server backend API.
126 This package contains all run-time libraries for IDZebra.
127 <literal>libidzebra1.4</literal>
128 This package includes documentation for IDZebra in PDF and HTML.
129 <literal>idzebra1.4-doc</literal>
130 This package includes common essential IDZebra configuration files
131 <literal>idzebra1.4-common</literal>
136 <sect2 id="componentindexer">
137 <title>Zebra Indexer</title>
139 the core Zebra indexer which
140 - loads external filter modules used for indexing data records of
142 - creates, updates and drops databases and indexes
145 This package contains IDZebra utilities such as the zebraidx indexer
146 utility and the zebrasrv server.
147 <literal>idzebra1.4-utils</literal>
151 <sect2 id="componentsearcher">
152 <title>Zebra Searcher/Retriever</title>
154 the core Zebra searcher/retriever which
157 This package contains IDZebra utilities such as the zebraidx indexer
158 utility and the zebrasrv server, and their associated man pages.
159 <literal>idzebra1.4-utils</literal>
163 <sect2 id="componentyazserver">
164 <title>YAZ Server Frontend</title>
166 The YAZ server frontend is
167 a full fledged stateful Z39.50 server taking client
168 connections, and forwarding search and scan requests to the
172 In addition to Z39.50 requests, the YAZ server frontend acts
173 as HTTP server, honouring
174 SRW SOAP requests, and SRU REST requests. Moreover, it can
175 translate inco ming CQL queries to PQF/RPN queries, if
176 correctly configured.
179 YAZ is a toolkit that allows you to develop software using the
180 ANSI Z39.50/ISO23950 standard for information retrieval.
182 <literal>libyazthread.so</literal>
183 <literal>libyaz.so</literal>
184 <literal>libyaz</literal>
188 <sect2 id="componentmodules">
189 <title>Record Models and Filter Modules</title>
191 all filter modules which do indexing and record display filtering:
192 This virtual package contains all base IDZebra filter modules. EMPTY ???
193 <literal>libidzebra1.4-modules</literal>
196 <sect3 id="componentmodulestext">
197 <title>TEXT Record Model and Filter Module</title>
199 Plain ASCII text filter
201 <literal>text module missing as deb file<literal>
206 <sect3 id="componentmodulesgrs">
207 <title>GRS Record Model and Filter Modules</title>
209 Chapter <xref linkend="record-model"/>
211 - grs.danbib GRS filters of various kind (*.abs files)
212 IDZebra filter grs.danbib (DBC DanBib records)
213 This package includes grs.danbib filter which parses DanBib records.
214 DanBib is the Danish Union Catalogue hosted by DBC
215 (Danish Bibliographic Centre).
216 <literal>libidzebra1.4-mod-grs-danbib</literal>
221 This package includes the grs.marc and grs.marcxml filters that allows
222 IDZebra to read MARC records based on ISO2709.
224 <literal>libidzebra1.4-mod-grs-marc</literal>
227 - grs.tcl GRS TCL scriptable filter
228 This package includes the grs.regx and grs.tcl filters.
229 <literal>libidzebra1.4-mod-grs-regx</literal>
233 <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
236 This package includes the grs.xml filter which uses Expat to
237 parse records in XML and turn them into IDZebra's internal grs node.
238 <literal>libidzebra1.4-mod-grs-xml</literal>
242 <sect3 id="componentmodulesalvis">
243 <title>ALVIS Record Model and Filter Module</title>
245 - alvis Experimental Alvis XSLT filter
246 <literal>mod-alvis.so</literal>
247 <literal>libidzebra1.4-mod-alvis</literal>
251 <sect3 id="componentmodulessafari">
252 <title>SAFARI Record Model and Filter Module</title>
256 <literal>safari module missing as deb file<literal>
263 <sect2 id="componentconfig">
264 <title>Configuration Files</title>
266 - yazserver XML based config file
267 - core Zebra ascii based config files
268 - filter module config files in many flavours
269 - CQL to PQF ascii based config file
278 <sect1 id="cqltopqf">
279 <title>Server Side CQL To PQF Conversion</title>
281 The cql2pqf.txt yaz-client config file, which is also used in the
282 yaz-server CQL-to-PQF process, is used to to drive
283 org.z3950.zing.cql.CQLNode's toPQF() back-end and the YAZ CQL-to-PQF
284 converter. This specifies the interpretation of various CQL
285 indexes, relations, etc. in terms of Type-1 query attributes.
287 This configuration file generates queries using BIB-1 attributes.
288 See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
289 for the Maintenance Agency's work-in-progress mapping of Dublin Core
290 indexes to Attribute Architecture (util, XD and BIB-2)
293 a) CQL set prefixes are specified using the correct CQL/SRW/U
294 prefixes for the required index sets, or user-invented prefixes for
295 special index sets. An index set in CQL is roughly speaking equivalent to a
296 namespace specifier in XML.
298 b) The default index set to be used if none explicitely mentioned
300 c) Index mapping definitions of the form
302 index.cql.all = 1=text
304 which means that the index "all" from the set "cql" is mapped on the
305 bib-1 RPN query "@attr 1=text" (where "text" is some existing index
306 in zebra, see indexing stylesheet)
308 d) Relation mapping from CQL relations to bib-1 RPN "@attr 2= " stuff
310 e) Relation modifier mapping from CQL relations to bib-1 RPN "@attr
313 f) Position attributes
315 g) structure attributes
317 h) truncation attributes
320 http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config
329 <title>Static and Dynamic Ranking</title>
331 Zebra uses internally inverted indexes to look up term occurencies
332 in documents. Multiple queries from different indexes can be
333 combined by the binary boolean operations AND, OR and/or NOT (which
334 is in fact a binary AND NOT operation). To ensure fast query execution
335 speed, all indexes have to be sorted in the same order.
337 The indexes are normally sorted according to document ID in
338 ascending order, and any query which does not invoke a special
339 re-ranking function will therefore retrieve the result set in document ID
346 directive in the main core Zebra config file, the internal document
347 keys used for ordering are augmented by a preceeding integer, which
348 contains the static rank of a given document, and the index lists
350 - first by ascending static rank
351 - then by ascending document ID.
353 This implies that the default rank "0" is the best rank at the
354 beginning of the list, and "max int" is the worst static rank.
356 The "alvis" and the experimental "xslt" filters are providing a
357 directive to fetch static rank information out of the indexed XML
358 records, thus making _all_ hit sets orderd after ascending static
359 rank, and for those doc's which have the same static rank, ordered
360 after ascending doc ID.
361 If one wants to do a little fiddeling with the static rank order,
362 one has to invoke additional re-ranking/re-ordering using dynamic
363 reranking or score functions. These functions return positive
364 interger scores, where _highest_ score is best, which means that the
365 hit sets will be sorted according to _decending_ scores (in contrary
366 to the index lists which are sorted according to _ascending_ rank
367 number and document ID)
370 Those are defined in the zebra C source files
372 "rank-1" : zebra/index/rank1.c
373 default TF/IDF like zebra dynamic ranking
374 "rank-static" : zebra/index/rankstatic.c
375 do-nothing dummy static ranking (this is just to prove
376 that the static rank can be used in dynamic ranking functions)
377 "zvrank" : zebra/index/zvrank.c
378 many different dynamic TF/IDF ranking functions
380 The are in the zebra config file enabled by a directive like:
384 Notice that the "rank-1" and "zvrank" do not use the static rank
385 information in the list keys, and will produce the same ordering
386 with our without static ranking enabled.
388 The dummy "rank-static" reranking/scoring function returns just
389 score = max int - staticrank
390 in order to preserve the ordering of hit sets with and without it's
393 Obviously, one wants to make a new ranking function, which combines
394 static and dynamic ranking, which is left as an exercise for the
395 reader .. (Wray, this is your's ...)
402 yazserver frontend config file
406 Setup of listening ports, and virtual zebra servers.
407 Note path to server-side CQL-to-PQF config file, and to
408 SRW explain config section.
410 The <directory> path is relative to the directory where zebra.init is placed
411 and is started up. The other pathes are relative to <directory>,
412 which in this case is the same.
414 see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl
419 c) Main "alvis" XSLT filter config file:
420 cat db/filter_alvis_conf.xml
422 <?xml version="1.0" encoding="UTF8"?>
424 <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
425 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
426 stylesheet="db/alvis2index.xsl" />
427 <schema name="dc" stylesheet="db/alvis2dc.xsl" />
428 <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
429 <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
430 <schema name="help" stylesheet="db/alvis2help.xsl" />
434 the pathes are relative to the directory where zebra.init is placed
437 The split level decides where the SAX parser shall split the
438 collections of records into individual records, which then are
439 loaded into DOM, and have the indexing XSLT stylesheet applied.
441 The indexing stylesheet is found by it's identifier.
443 All the other stylesheets are for presentation after search.
445 - in data/ a short sample of harvested carnivorous plants
446 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
448 - in root also one single data record - nice for testing the xslt
451 xsltproc db/alvis2index.xsl carni*.xml
455 - in db/ a cql2pqf.txt yaz-client config file
456 which is also used in the yaz-server CQL-to-PQF process
458 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
460 - in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
461 as it constructs the new XML structure by pulling data out of the
462 respective elements/attributes of the old structure.
464 Notice the special zebra namespace, and the special elements in this
465 namespace which indicate to the zebra indexer what to do.
467 <z:record id="67ht7" rank="675" type="update">
468 indicates that a new record with given id and static rank has to be updated.
470 <z:index name="title" type="w">
471 encloses all the text/XML which shall be indexed in the index named
472 "title" and of index type "w" (see file default.idx in your zebra
481 search like this (using client-side CQL-to-PQF conversion):
483 yaz-client -q db/cql2pqf.txt localhost:9999
486 > f text=(plant and soil)
498 search like this (using server-side CQL-to-PQF conversion):
499 (the only difference is "querytype cql" instead of
500 "querytype cql2rpn" and the call without specifying a local
503 yaz-client localhost:9999
506 > f text=(plant and soil)
517 NEW: static relevance ranking - see examples in alvis2index.xsl
519 > f text = /relevant (plant and soil)
523 > f title = /relevant a
530 Surf into http://localhost:9999
532 firefox http://localhost:9999
534 gives you an explain record. Unfortunately, the data found in the
535 CQL-to-PQF text file must be added by hand-craft into the explain
536 section of the yazserver.xml file. Too bad, but this is all extreme
537 new alpha stuff, and a lot of work has yet to be done ..
539 Searching via SRU: surf into the URL (lines broken here - concat on
542 - see number of hits:
543 http://localhost:9999/?version=1.1&operation=searchRetrieve
544 &query=text=(plant%20and%20soil)
547 - fetch record 5-7 in DC format
548 http://localhost:9999/?version=1.1&operation=searchRetrieve
549 &query=text=(plant%20and%20soil)
550 &startRecord=5&maximumRecords=2&recordSchema=dc
553 - even search using PQF queries using the extended verb "x-pquery",
554 which is special to YAZ/Zebra
556 http://localhost:9999/?version=1.1&operation=searchRetrieve
557 &x-pquery=@attr%201=text%20@and%20plant%20soil
559 More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/
562 read the fine manual at
563 http://www.loc.gov/z3950/agency/zing/srw/
566 and so on. The list of available indexes is found in db/cql2pqf.txt
569 7) How do you add to the index attributes of any other type than "w"?
570 I mean, in the context of making CQL queries. Let's say I want a date
571 attribute in there, so that one could do date > 20050101 in CQL.
573 Currently for example 'date-modified' is of type 'w'.
575 The 2-seconds-of-though solution:
579 <z:index name="date-modified" type="d">
581 select="acquisition/acquisitionData/modifiedDate"/>
584 But here's the catch...doesn't the use of the 'd' type require
585 structure type 'date' (@attr 4=5) in PQF? But then...how does that
586 reflect in the CQL->RPN/PQF mapping - does it really work if I just
587 change the type of an element in alvis2index.sl? I would think not...?
597 f @attr 4=5 @attr 1=date-modified 20050713
604 f @attr 4=5 @attr 1=date-modified 20050713
610 f date-modified=20050713
612 f date-modified=20050713
614 Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att
615 r 4=1 @attr 2=3 @attr "1=date-modified" 20050713
619 f date-modified eq 20050713
621 Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5
622 @attr 2=3 @attr "1=date-modified" 20050713
628 E) EXTENDED SERVICE LIFE UPDATES
630 The extended services are not enabled by default in zebra - due to the
631 fact that they modify the system.
633 In order to allow anybody to update, use
637 Or, even better, allow only updates for a particular admin user. For
638 user 'admin', you could use:
642 And in passwordfile, specify users and passwords ..
645 We can now start a yaz-client admin session and create a database:
647 $ yaz-client localhost:9999 -u admin/secret
648 Authentication set to Open (admin/secret)
651 Connection accepted by v3 target.
653 Name : Zebra Information Server/GFS/YAZ
654 Version: Zebra 1.4.0/1.63/2.1.9
655 Options: search present delSet triggerResourceCtrl scan sort
656 extendedServices namedResultSets
660 Got extended services response
664 Now Default was created.. We can now insert an XML file (esdd0006.grs
665 from example/gils/records) and index it:
667 Z> update insert 1 esdd0006.grs
668 Got extended services response
672 The 3rd parameter.. 1 here .. is the opaque record id from Ext update.
673 It a record ID that _we_ assign to the record in question. If we do not
674 assign one the usual rules for match apply (recordId: from zebra.cfg).
676 Actually, we should have a way to specify "no opaque record id" for
677 yaz-client's update command.. We'll fix that.
682 Received SearchResponse.
683 Search was a success.
684 Number of hits: 1, setno 1
685 SearchResult-1: term=utah cnt=1
689 Let's delete the beast:
691 No last record (update ignored)
692 Z> update delete 1 esdd0006.grs
693 Got extended services response
698 Received SearchResponse.
699 Search was a success.
700 Number of hits: 0, setno 2
701 SearchResult-1: term=utah cnt=0
705 If shadow register is enabled you must run the adm-commit command in
706 order write your changes..
717 <!-- Keep this comment at the end of the file
722 sgml-minimize-attributes:nil
723 sgml-always-quote-attributes:t
726 sgml-parent-document: "zebra.xml"
727 sgml-local-catalogs: nil
728 sgml-namecase-general:t