1 <chapter id="architecture">
2 <!-- $Id: architecture.xml,v 1.4 2006-02-15 12:08:48 marc Exp $ -->
3 <title>Overview of Zebra Architecture</title>
6 <sect1 id="architecture-representation">
7 <title>Local Representation</title>
10 As mentioned earlier, Zebra places few restrictions on the type of
11 data that you can index and manage. Generally, whatever the form of
12 the data, it is parsed by an input filter specific to that format, and
13 turned into an internal structure that Zebra knows how to handle. This
14 process takes place whenever the record is accessed - for indexing and
19 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
20 the <literal>-t</literal> option to the indexer tells Zebra how to
21 process input records.
22 Two basic types of processing are available - raw text and structured
23 data. Raw text is just that, and it is selected by providing the
24 argument <emphasis>text</emphasis> to Zebra. Structured records are
25 all handled internally using the basic mechanisms described in the
27 Zebra can read structured records in many different formats.
29 How this is done is governed by additional parameters after the
30 "grs" keyword, separated by "." characters.
35 <sect1 id="architecture-workflow">
36 <title>Indexing and Retrieval Workflow</title>
39 Records pass through three different states during processing in the
49 When records are accessed by the system, they are represented
50 in their local, or native format. This might be SGML or HTML files,
51 News or Mail archives, MARC records. If the system doesn't already
52 know how to read the type of data you need to store, you can set up an
53 input filter by preparing conversion rules based on regular
54 expressions and possibly augmented by a flexible scripting language
56 The input filter produces as output an internal representation,
64 When records are processed by the system, they are represented
65 in a tree-structure, constructed by tagged data elements hanging off a
66 root node. The tagged elements may contain data or yet more tagged
67 elements in a recursive structure. The system performs various
68 actions on this tree structure (indexing, element selection, schema
76 Before transmitting records to the client, they are first
77 converted from the internal structure to a form suitable for exchange
78 over the network - according to the Z39.50 standard.
88 <sect1 id="architecture-maincomponents">
89 <title>Main Components</title>
91 The Zebra system is designed to support a wide range of data management
92 applications. The system can be configured to handle virtually any
93 kind of structured data. Each record in the system is associated with
94 a <emphasis>record schema</emphasis> which lends context to the data
95 elements of the record.
96 Any number of record schemas can coexist in the system.
97 Although it may be wise to use only a single schema within
98 one database, the system poses no such restrictions.
101 The Zebra indexer and information retrieval server consists of the
102 following main applications: the <literal>zebraidx</literal>
103 indexing maintenance utility, and the <literal>zebrasrv</literal>
104 information query and retireval server. Both are using some of the
105 same main components, which are presented here.
108 This virtual package installs all the necessary packages to start
109 working with Zebra - including utility programs, development libraries,
110 documentation and modules.
111 <literal>idzebra1.4</literal>
114 <sect2 id="componentcore">
115 <title>Core Zebra Module Containing Common Functionality</title>
117 - loads external filter modules used for presenting
118 the recods in a search response.
119 - executes search requests in PQF/RPN, which are handed over from
120 the YAZ server frontend API
121 - calls resorting/reranking algorithms on the hit sets
122 - returns - possibly ranked - result sets, hit
123 numbers, and the like internal data to the YAZ server backend API.
126 This package contains all run-time libraries for Zebra.
127 <literal>libidzebra1.4</literal>
128 This package includes documentation for Zebra in PDF and HTML.
129 <literal>idzebra1.4-doc</literal>
130 This package includes common essential Zebra configuration files
131 <literal>idzebra1.4-common</literal>
136 <sect2 id="componentindexer">
137 <title>Zebra Indexer</title>
139 the core Zebra indexer which
140 - loads external filter modules used for indexing data records of
142 - creates, updates and drops databases and indexes
145 This package contains Zebra utilities such as the zebraidx indexer
146 utility and the zebrasrv server.
147 <literal>idzebra1.4-utils</literal>
151 <sect2 id="componentsearcher">
152 <title>Zebra Searcher/Retriever</title>
154 the core Zebra searcher/retriever which
157 This package contains Zebra utilities such as the zebraidx indexer
158 utility and the zebrasrv server, and their associated man pages.
159 <literal>idzebra1.4-utils</literal>
163 <sect2 id="componentyazserver">
164 <title>YAZ Server Frontend</title>
166 The YAZ server frontend is
167 a full fledged stateful Z39.50 server taking client
168 connections, and forwarding search and scan requests to the
172 In addition to Z39.50 requests, the YAZ server frontend acts
173 as HTTP server, honouring
174 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> SOAP requests, and <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink> REST requests. Moreover, it can
175 translate inco ming <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries to PQF/RPN queries, if
176 correctly configured.
179 YAZ is a toolkit that allows you to develop software using the
180 ANSI Z39.50/ISO23950 standard for information retrieval.
181 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/ <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>
182 <literal>libyazthread.so</literal>
183 <literal>libyaz.so</literal>
184 <literal>libyaz</literal>
188 <sect2 id="componentmodules">
189 <title>Record Models and Filter Modules</title>
191 all filter modules which do indexing and record display filtering:
192 This virtual package contains all base IDZebra filter modules. EMPTY ???
193 <literal>libidzebra1.4-modules</literal>
196 <sect3 id="componentmodulestext">
197 <title>TEXT Record Model and Filter Module</title>
199 Plain ASCII text filter
201 <literal>text module missing as deb file<literal>
206 <sect3 id="componentmodulesgrs">
207 <title>GRS Record Model and Filter Modules</title>
209 <xref linkend="record-model-grs"/>
211 - grs.danbib GRS filters of various kind (*.abs files)
212 IDZebra filter grs.danbib (DBC DanBib records)
213 This package includes grs.danbib filter which parses DanBib records.
214 DanBib is the Danish Union Catalogue hosted by DBC
215 (Danish Bibliographic Centre).
216 <literal>libidzebra1.4-mod-grs-danbib</literal>
221 This package includes the grs.marc and grs.marcxml filters that allows
222 IDZebra to read MARC records based on ISO2709.
224 <literal>libidzebra1.4-mod-grs-marc</literal>
227 - grs.tcl GRS TCL scriptable filter
228 This package includes the grs.regx and grs.tcl filters.
229 <literal>libidzebra1.4-mod-grs-regx</literal>
233 <literal>libidzebra1.4-mod-grs-sgml not packaged yet ??</literal>
236 This package includes the grs.xml filter which uses <ulink url="http://expat.sourceforge.net/">Expat</ulink> to
237 parse records in XML and turn them into IDZebra's internal grs node.
238 <literal>libidzebra1.4-mod-grs-xml</literal>
242 <sect3 id="componentmodulesalvis">
243 <title>ALVIS Record Model and Filter Module</title>
245 <xref linkend="record-model-alvisxslt"/>
246 - alvis Experimental Alvis XSLT filter
247 <literal>mod-alvis.so</literal>
248 <literal>libidzebra1.4-mod-alvis</literal>
252 <sect3 id="componentmodulessafari">
253 <title>SAFARI Record Model and Filter Module</title>
257 <literal>safari module missing as deb file<literal>
265 <sect2 id="componentconfig">
266 <title>Configuration Files</title>
268 - yazserver XML based config file
269 - core Zebra ascii based config files
270 - filter module config files in many flavours
271 - <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> to PQF ascii based config file
280 <sect1 id="cqltopqf">
281 <title>Server Side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> To PQF Conversion</title>
283 The cql2pqf.txt yaz-client config file, which is also used in the
284 yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process, is used to to drive
285 org.z3950.zing.cql.<ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>Node's toPQF() back-end and the YAZ <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF
286 converter. This specifies the interpretation of various <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
287 indexes, relations, etc. in terms of Type-1 query attributes.
289 This configuration file generates queries using BIB-1 attributes.
290 See http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html
291 for the Maintenance Agency's work-in-progress mapping of Dublin Core
292 indexes to Attribute Architecture (util, XD and BIB-2)
295 a) <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> set prefixes are specified using the correct <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>/ <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U
296 prefixes for the required index sets, or user-invented prefixes for
297 special index sets. An index set in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> is roughly speaking equivalent to a
298 namespace specifier in XML.
300 b) The default index set to be used if none explicitely mentioned
302 c) Index mapping definitions of the form
304 index.cql.all = 1=text
306 which means that the index "all" from the set "cql" is mapped on the
307 bib-1 RPN query "@attr 1=text" (where "text" is some existing index
308 in zebra, see indexing stylesheet)
310 d) Relation mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr 2= " stuff
312 e) Relation modifier mapping from <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> relations to bib-1 RPN "@attr
315 f) Position attributes
317 g) structure attributes
319 h) truncation attributes
322 http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map for config
331 <title>Static and Dynamic Ranking</title>
333 Zebra uses internally inverted indexes to look up term occurencies
334 in documents. Multiple queries from different indexes can be
335 combined by the binary boolean operations AND, OR and/or NOT (which
336 is in fact a binary AND NOT operation). To ensure fast query execution
337 speed, all indexes have to be sorted in the same order.
339 The indexes are normally sorted according to document ID in
340 ascending order, and any query which does not invoke a special
341 re-ranking function will therefore retrieve the result set in document ID
348 directive in the main core Zebra config file, the internal document
349 keys used for ordering are augmented by a preceeding integer, which
350 contains the static rank of a given document, and the index lists
352 - first by ascending static rank
353 - then by ascending document ID.
355 This implies that the default rank "0" is the best rank at the
356 beginning of the list, and "max int" is the worst static rank.
358 The "alvis" and the experimental "xslt" filters are providing a
359 directive to fetch static rank information out of the indexed XML
360 records, thus making _all_ hit sets orderd after ascending static
361 rank, and for those doc's which have the same static rank, ordered
362 after ascending doc ID.
363 If one wants to do a little fiddeling with the static rank order,
364 one has to invoke additional re-ranking/re-ordering using dynamic
365 reranking or score functions. These functions return positive
366 interger scores, where _highest_ score is best, which means that the
367 hit sets will be sorted according to _decending_ scores (in contrary
368 to the index lists which are sorted according to _ascending_ rank
369 number and document ID)
372 Those are defined in the zebra C source files
374 "rank-1" : zebra/index/rank1.c
375 default TF/IDF like zebra dynamic ranking
376 "rank-static" : zebra/index/rankstatic.c
377 do-nothing dummy static ranking (this is just to prove
378 that the static rank can be used in dynamic ranking functions)
379 "zvrank" : zebra/index/zvrank.c
380 many different dynamic TF/IDF ranking functions
382 The are in the zebra config file enabled by a directive like:
386 Notice that the "rank-1" and "zvrank" do not use the static rank
387 information in the list keys, and will produce the same ordering
388 with our without static ranking enabled.
390 The dummy "rank-static" reranking/scoring function returns just
391 score = max int - staticrank
392 in order to preserve the ordering of hit sets with and without it's
395 Obviously, one wants to make a new ranking function, which combines
396 static and dynamic ranking, which is left as an exercise for the
397 reader .. (Wray, this is your's ...)
404 yazserver frontend config file
408 Setup of listening ports, and virtual zebra servers.
409 Note path to server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF config file, and to
410 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink> explain config section.
412 The <directory> path is relative to the directory where zebra.init is placed
413 and is started up. The other pathes are relative to <directory>,
414 which in this case is the same.
416 see: http://www.indexdata.com/yaz/doc/server.vhosts.tkl
423 search like this (using client-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
425 yaz-client -q db/cql2pqf.txt localhost:9999
428 > f text=(plant and soil)
440 search like this (using server-side <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF conversion):
441 (the only difference is "querytype cql" instead of
442 "querytype cql2rpn" and the call without specifying a local
445 yaz-client localhost:9999
448 > f text=(plant and soil)
459 NEW: static relevance ranking - see examples in alvis2index.xsl
461 > f text = /relevant (plant and soil)
465 > f title = /relevant a
471 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>/U searching
472 Surf into http://localhost:9999
474 firefox http://localhost:9999
476 gives you an explain record. Unfortunately, the data found in the
477 <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF text file must be added by hand-craft into the explain
478 section of the yazserver.xml file. Too bad, but this is all extreme
479 new alpha stuff, and a lot of work has yet to be done ..
481 Searching via <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>: surf into the URL (lines broken here - concat on
484 - see number of hits:
485 http://localhost:9999/?version=1.1&operation=searchRetrieve
486 &query=text=(plant%20and%20soil)
489 - fetch record 5-7 in DC format
490 http://localhost:9999/?version=1.1&operation=searchRetrieve
491 &query=text=(plant%20and%20soil)
492 &startRecord=5&maximumRecords=2&recordSchema=dc
495 - even search using PQF queries using the extended verb "x-pquery",
496 which is special to YAZ/Zebra
498 http://localhost:9999/?version=1.1&operation=searchRetrieve
499 &x-pquery=@attr%201=text%20@and%20plant%20soil
501 More info: read the fine manuals at http://www.loc.gov/z3950/agency/zing/srw/
503 Search via <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>:
504 read the fine manual at
505 http://www.loc.gov/z3950/agency/zing/srw/
508 and so on. The list of available indexes is found in db/cql2pqf.txt
511 7) How do you add to the index attributes of any other type than "w"?
512 I mean, in the context of making <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink> queries. Let's say I want a date
513 attribute in there, so that one could do date > 20050101 in <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>.
515 Currently for example 'date-modified' is of type 'w'.
517 The 2-seconds-of-though solution:
521 <z:index name="date-modified" type="d">
523 select="acquisition/acquisitionData/modifiedDate"/>
526 But here's the catch...doesn't the use of the 'd' type require
527 structure type 'date' (@attr 4=5) in PQF? But then...how does that
528 reflect in the <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>->RPN/PQF mapping - does it really work if I just
529 change the type of an element in alvis2index.sl? I would think not...?
539 f @attr 4=5 @attr 1=date-modified 20050713
546 f @attr 4=5 @attr 1=date-modified 20050713
552 f date-modified=20050713
554 f date-modified=20050713
556 Search ERROR 121 4 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @att
557 r 4=1 @attr 2=3 @attr "1=date-modified" 20050713
561 f date-modified eq 20050713
563 Search OK 23 3 1+0 RPN: @attrset Bib-1 @attr 5=100 @attr 6=1 @attr 3=3 @attr 4=5
564 @attr 2=3 @attr "1=date-modified" 20050713
570 E) EXTENDED SERVICE LIFE UPDATES
572 The extended services are not enabled by default in zebra - due to the
573 fact that they modify the system.
575 In order to allow anybody to update, use
579 Or, even better, allow only updates for a particular admin user. For
580 user 'admin', you could use:
584 And in passwordfile, specify users and passwords ..
587 We can now start a yaz-client admin session and create a database:
589 $ yaz-client localhost:9999 -u admin/secret
590 Authentication set to Open (admin/secret)
593 Connection accepted by v3 target.
595 Name : Zebra Information Server/GFS/YAZ
596 Version: Zebra 1.4.0/1.63/2.1.9
597 Options: search present delSet triggerResourceCtrl scan sort
598 extendedServices namedResultSets
602 Got extended services response
606 Now Default was created.. We can now insert an XML file (esdd0006.grs
607 from example/gils/records) and index it:
609 Z> update insert 1 esdd0006.grs
610 Got extended services response
614 The 3rd parameter.. 1 here .. is the opaque record id from Ext update.
615 It a record ID that _we_ assign to the record in question. If we do not
616 assign one the usual rules for match apply (recordId: from zebra.cfg).
618 Actually, we should have a way to specify "no opaque record id" for
619 yaz-client's update command.. We'll fix that.
624 Received SearchResponse.
625 Search was a success.
626 Number of hits: 1, setno 1
627 SearchResult-1: term=utah cnt=1
631 Let's delete the beast:
633 No last record (update ignored)
634 Z> update delete 1 esdd0006.grs
635 Got extended services response
640 Received SearchResponse.
641 Search was a success.
642 Number of hits: 0, setno 2
643 SearchResult-1: term=utah cnt=0
647 If shadow register is enabled you must run the adm-commit command in
648 order write your changes..
659 <!-- Keep this comment at the end of the file
664 sgml-minimize-attributes:nil
665 sgml-always-quote-attributes:t
668 sgml-parent-document: "zebra.xml"
669 sgml-local-catalogs: nil
670 sgml-namecase-general:t