1 <chapter id="record-model-domxml">
2 <!-- $Id: recordmodel-domxml.xml,v 1.1 2007-02-20 14:28:31 marc Exp $ -->
3 <title>&dom; &xml; Record Model and Filter Module</title>
6 The record model described in this chapter applies to the fundamental,
8 record type <literal>dom</literal>, introduced in
9 <xref linkend="componentmodulesdom"/>. The &dom; &xml; record model
10 is experimental, and it's inner workings might change in future
11 releases of the &zebra; Information Server.
16 <section id="record-model-domxml-filter">
17 <title>&dom; Record Filter</title>
20 The &dom; &xml; filter uses a standard &dom; &xml; structure as
21 internal data model, and can therefore parse, index, and display
22 any &xml; document type. It is wellsuited to work on
23 standardized &xml;-based formats such as Dublin Core, MODS, METS,
24 MARCXML, OAI-PMH, RSS, and performs equally well on any other
25 non-standard &xml; format.
28 A parser for binary &marc; records based on the ISO2709 library
29 standard is provided, it transforms these to the internal
30 &marcxml; &dom; representation. Other binary document parsers
31 are planned to follow.
36 <section id="record-model-domxml-architecture">
37 <title>&dom; &xml; filter architecture</title>
40 The internal &dom; &xml; representation can be fed into four
41 different pipelines, consisting of arbitraily many sucessive
42 &xslt; transformations.
45 <table id="record-model-domxml-architecture-table" frame="top">
46 <title>&dom; &xml; filter pipelines overview</title>
52 <entry>Description</entry>
60 <entry><literal>input</literal></entry>
62 <entry>input parsing and initial
63 transformations to common &xml; format</entry>
64 <entry>raw &xml; record buffers, &xml; streams and
65 binary &marc; buffers</entry>
66 <entry>single &dom; &xml; documents suitable for indexing and
67 internal storage</entry>
70 <entry><literal>extract</literal></entry>
72 <entry>indexing term extraction
73 transformations</entry>
74 <entry>common single &dom; &xml; format</entry>
75 <entry>&zebra; internal indexing &dom; &xml; document</entry>
78 <entry><literal>store</literal></entry>
80 <entry> transformations before internal document
82 <entry>common single &dom; &xml; format</entry>
83 <entry>&zebra; internal storage &dom; &xml; document</entry>
86 <entry><literal>retrieve</literal></entry>
88 <entry>document retrieve transformations from storage to output
89 syntax and format</entry>
90 <entry>&zebra; internal storage &dom; &xml; document</entry>
91 <entry>requested output syntax and format</entry>
98 The &dom; &xml; filter pipelines use &xslt; (and if supported on
99 your platform, even &exslt;), it brings thus full &xpath;
100 support to the indexing, storage and display rules of not only
101 &xml; documents, but also binary &marc; records.
106 <section id="record-model-domxml-pipeline">
107 <title>&dom; &xml; filter pipeline configuration</title>
110 The experimental, loadable &dom; &xml;/&xslt; filter module
111 <literal>mod-dom.so</literal> is packaged in the GNU/Debian package
112 <literal>libidzebra2.0-mod-dom</literal>.
113 It is invoked by the <filename>zebra.cfg</filename> configuration statement
115 recordtype.xml: dom.db/filter_dom_conf.xml
117 In this example on all data files with suffix
118 <filename>*.xml</filename>, where the
119 &dom; &xslt; filter configuration file is found in the
120 path <filename>db/filter_dom_conf.xml</filename>.
127 <para>The &dom; &xslt; filter configuration file must be
128 valid &xml;. It might look like this (This example is
129 used for indexing and display of &oai; harvested records):
131 <?xml version="1.0" encoding="UTF-8"?>
133 <schema name="identity" stylesheet="xsl/identity.xsl" />
134 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
135 stylesheet="xsl/oai2index.xsl" />
136 <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
137 <!-- use split level 2 when indexing whole &oai; Record lists -->
138 <split level="2"/>
143 All named stylesheets defined inside
144 <literal>schema</literal> element tags
145 are for presentation after search, including
146 the indexing stylesheet (which is a great debugging help). The
147 names defined in the <literal>name</literal> attributes must be
148 unique, these are the literal <literal>schema</literal> or
149 <literal>element set</literal> names used in
150 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
151 <ulink url="&url.sru;">&sru;</ulink> and
152 &z3950; protocol queries.
153 The paths in the <literal>stylesheet</literal> attributes
154 are relative to zebras working directory, or absolute to file
158 The <literal><split level="2"/></literal> decides where the
159 &xml; Reader shall split the
160 collections of records into individual records, which then are
161 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
164 There must be exactly one indexing &xslt; stylesheet, which is
165 defined by the magic attribute
166 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
169 <section id="record-model-domxml-internal">
170 <title>&dom; Internal Record Representation</title>
171 <para>When indexing, an &xml; Reader is invoked to split the input
172 files into suitable record &xml; pieces. Each record piece is then
173 transformed to an &xml; &dom; structure, which is essentially the
174 record model. Only &xslt; transformations can be applied during
175 index, search and retrieval. Consequently, output formats are
176 restricted to whatever &xslt; can deliver from the record &xml;
177 structure, be it other &xml; formats, HTML, or plain text. In case
178 you have <literal>libxslt1</literal> running with E&xslt; support,
179 you can use this functionality inside the &dom;
180 filter configuration &xslt; stylesheets.
184 <section id="record-model-domxml-canonical">
185 <title>&dom; Canonical Indexing Format</title>
186 <para>The output of the indexing &xslt; stylesheets must contain
187 certain elements in the magic
188 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
189 namespace. The output of the &xslt; indexing transformation is then
190 parsed using &dom; methods, and the contained instructions are
191 performed on the <emphasis>magic elements and their
195 For example, the output of the command
197 xsltproc xsl/oai2index.xsl one-record.xml
199 might look like this:
201 <?xml version="1.0" encoding="UTF-8"?>
202 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
203 z:id="oai:JTRS:CP-3290---Volume-I"
206 <z:index name="oai_identifier" type="0">
207 oai:JTRS:CP-3290---Volume-I</z:index>
208 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
209 <z:index name="oai_setspec" type="0">jtrs</z:index>
210 <z:index name="dc_all" type="w">
211 <z:index name="dc_title" type="w">Proceedings of the 4th
212 International Conference and Exhibition:
213 World Congress on Superconductivity - Volume I</z:index>
214 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
215 Burnham, Editors</z:index>
220 <para>This means the following: From the original &xml; file
221 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
222 same form coming from a splitted input file), the indexing
223 stylesheet produces an indexing &xml; record, which is defined by
224 the <literal>record</literal> element in the magic namespace
225 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
226 &zebra; uses the content of
227 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
228 record ID, and - in case static ranking is set - the content of
229 <literal>z:rank="47896"</literal> as static rank. Following the
230 discussion in <xref linkend="administration-ranking"/>
231 we see that this records is internally ordered
232 lexicographically according to the value of the string
233 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
234 The type of action performed during indexing is defined by
235 <literal>z:type="update"></literal>, with recognized values
236 <literal>insert</literal>, <literal>update</literal>, and
237 <literal>delete</literal>.
239 <para>In this example, the following literal indexes are constructed:
248 where the indexing type is defined in the
249 <literal>type</literal> attribute
250 (any value from the standard configuration
251 file <filename>default.idx</filename> will do). Finally, any
252 <literal>text()</literal> node content recursively contained
253 inside the <literal>index</literal> will be filtered through the
254 appropriate charmap for character normalization, and will be
255 inserted in the index.
258 Specific to this example, we see that the single word
259 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
260 byte for byte without any form of character normalization,
261 inserted into the index named <literal>oai:identifier</literal>,
263 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
264 will be inserted using the <literal>w</literal> character
265 normalization defined in <filename>default.idx</filename> into
266 the index <literal>dc:creator</literal> (that is, after character
267 normalization the index will keep the inidividual words
268 <literal>kumar</literal>, <literal>krishen</literal>,
269 <literal>and</literal>, <literal>calvin</literal>,
270 <literal>burnham</literal>, and <literal>editors</literal>), and
271 finally both the texts
272 <literal>Proceedings of the 4th International Conference and Exhibition:
273 World Congress on Superconductivity - Volume I</literal>
275 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
276 will be inserted into the index <literal>dc:all</literal> using
277 the same character normalization map <literal>w</literal>.
280 Finally, this example configuration can be queried using &pqf;
281 queries, either transported by &z3950;, (here using a yaz-client)
284 Z> open localhost:9999
288 Z> f @attr 1=dc_creator Kumar
289 Z> scan @attr 1=dc_creator adam
291 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
292 Z> scan @attr 1=dc_title abc
296 extentions <literal>x-pquery</literal> and
297 <literal>x-pScanClause</literal> to
301 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
302 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
305 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
306 configuration, and <xref linkend="gfs-config"/> or the &yaz;
307 <ulink url="&url.yaz.cql;">&cql; section</ulink>
308 for the details or the &yaz; frontend server.
311 Notice that there are no <filename>*.abs</filename>,
312 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
313 filter configuration files involves in this process, and that the
314 literal index names are used during search and retrieval.
320 <section id="record-model-domxml-conf">
321 <title>&dom; Record Model Configuration</title>
324 <section id="record-model-domxml-index">
325 <title>&dom; Indexing Configuration</title>
327 As mentioned above, there can be only one indexing
328 stylesheet, and configuration of the indexing process is a synonym
329 of writing an &xslt; stylesheet which produces &xml; output containing the
330 magic elements discussed in
331 <xref linkend="record-model-domxml-internal"/>.
332 Obviously, there are million of different ways to accomplish this
333 task, and some comments and code snippets are in order to lead
334 our paduans on the right track to the good side of the force.
337 Stylesheets can be written in the <emphasis>pull</emphasis> or
338 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
339 means that the output &xml; structure is taken as starting point of
340 the internal structure of the &xslt; stylesheet, and portions of
341 the input &xml; are <emphasis>pulled</emphasis> out and inserted
342 into the right spots of the output &xml; structure. On the other
343 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
344 calling their template definitions, a process which is commanded
345 by the input &xml; structure, and avake to produce some output &xml;
346 whenever some special conditions in the input styelsheets are
347 met. The <emphasis>pull</emphasis> type is well-suited for input
348 &xml; with strong and well-defined structure and semantcs, like the
349 following &oai; indexing example, whereas the
350 <emphasis>push</emphasis> type might be the only possible way to
351 sort out deeply recursive input &xml; formats.
354 A <emphasis>pull</emphasis> stylesheet example used to index
355 &oai; harvested records could use some of the following template
359 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
360 xmlns:z="http://indexdata.dk/zebra/xslt/1"
361 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
362 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
363 xmlns:dc="http://purl.org/dc/elements/1.1/"
366 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
368 <!-- disable all default text node output -->
369 <xsl:template match="text()"/>
371 <!-- match on oai xml record root -->
372 <xsl:template match="/">
373 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
375 <!-- you might want to use z:rank="{some &xslt; function here}" -->
376 <xsl:apply-templates/>
380 <!-- &oai; indexing templates -->
381 <xsl:template match="oai:record/oai:header/oai:identifier">
382 <z:index name="oai_identifier" type="0">
383 <xsl:value-of select="."/>
389 <!-- DC specific indexing templates -->
390 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
391 <z:index name="dc_title" type="w">
392 <xsl:value-of select="."/>
404 that the names and types of the indexes can be defined in the
405 indexing &xslt; stylesheet <emphasis>dynamically according to
406 content in the original &xml; records</emphasis>, which has
407 opportunities for great power and wizardery as well as grande
411 The following excerpt of a <emphasis>push</emphasis> stylesheet
412 <emphasis>might</emphasis>
413 be a good idea according to your strict control of the &xml;
414 input format (due to rigerours checking against well-defined and
415 tight RelaxNG or &xml; Schema's, for example):
418 <xsl:template name="element-name-indexes">
419 <z:index name="{name()}" type="w">
420 <xsl:value-of select="'1'"/>
425 This template creates indexes which have the name of the working
426 node of any input &xml; file, and assigns a '1' to the index.
428 <literal>find @attr 1=xyz 1</literal>
429 finds all files which contain at least one
430 <literal>xyz</literal> &xml; element. In case you can not control
431 which element names the input files contain, you might ask for
432 disaster and bad karma using this technique.
435 One variation over the theme <emphasis>dynamically created
436 indexes</emphasis> will definitely be unwise:
439 <!-- match on oai xml record root -->
440 <xsl:template match="/">
441 <z:record z:type="update">
443 <!-- create dynamic index name from input content -->
444 <xsl:variable name="dynamic_content">
445 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
448 <!-- create zillions of indexes with unknown names -->
449 <z:index name="{$dynamic_content}" type="w">
450 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
457 Don't be tempted to cross
458 the line to the dark side of the force, paduan; this leads
459 to suffering and pain, and universal
460 disentigration of your project schedule.
464 <section id="record-model-domxml-elementset">
465 <title>&dom; Exchange Formats</title>
467 An exchange format can be anything which can be the outcome of an
468 &xslt; transformation, as far as the stylesheet is registered in
469 the main &dom; &xslt; filter configuration file, see
470 <xref linkend="record-model-domxml-filter"/>.
471 In principle anything that can be expressed in &xml;, HTML, and
472 TEXT can be the output of a <literal>schema</literal> or
473 <literal>element set</literal> directive during search, as long as
474 the information comes from the
475 <emphasis>original input record &xml; &dom; tree</emphasis>
476 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
479 In addition, internal administrative information from the &zebra;
480 indexer can be accessed during record retrieval. The following
481 example is a summary of the possibilities:
484 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
485 xmlns:z="http://indexdata.dk/zebra/xslt/1"
488 <!-- register internal zebra parameters -->
489 <xsl:param name="id" select="''"/>
490 <xsl:param name="filename" select="''"/>
491 <xsl:param name="score" select="''"/>
492 <xsl:param name="schema" select="''"/>
494 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
496 <!-- use then for display of internal information -->
497 <xsl:template match="/">
499 <id><xsl:value-of select="$id"/></id>
500 <filename><xsl:value-of select="$filename"/></filename>
501 <score><xsl:value-of select="$score"/></score>
502 <schema><xsl:value-of select="$schema"/></schema>
513 <section id="record-model-domxml-example">
514 <title>&dom; Filter &oai; Indexing Example</title>
516 The sourcecode tarball contains a working &dom; filter example in
517 the directory <filename>examples/dom-oai/</filename>, which
518 should get you started.
521 More example data can be harvested from any &oai; complient server,
522 see details at the &oai;
523 <ulink url="http://www.openarchives.org/">
524 http://www.openarchives.org/</ulink> web site, and the community
526 <ulink url="http://www.openarchives.org/community/index.html">
527 http://www.openarchives.org/community/index.html</ulink>.
530 <ulink url="http://www.oaforum.org/tutorial/">
531 http://www.oaforum.org/tutorial/</ulink>.
543 c) Main "dom" &xslt; filter config file:
544 cat db/filter_dom_conf.xml
546 <?xml version="1.0" encoding="UTF8"?>
548 <schema name="dom" stylesheet="db/dom2dom.xsl" />
549 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
550 stylesheet="db/dom2index.xsl" />
551 <schema name="dc" stylesheet="db/dom2dc.xsl" />
552 <schema name="dc-short" stylesheet="db/dom2dc_short.xsl" />
553 <schema name="snippet" snippet="25" stylesheet="db/dom2snippet.xsl" />
554 <schema name="help" stylesheet="db/dom2help.xsl" />
558 the paths are relative to the directory where zebra.init is placed
561 The split level decides where the SAX parser shall split the
562 collections of records into individual records, which then are
563 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
565 The indexing stylesheet is found by it's identifier.
567 All the other stylesheets are for presentation after search.
569 - in data/ a short sample of harvested carnivorous plants
570 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
572 - in root also one single data record - nice for testing the xslt
575 xsltproc db/dom2index.xsl carni*.xml
579 - in db/ a cql2pqf.txt yaz-client config file
580 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
582 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
584 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
585 as it constructs the new &xml; structure by pulling data out of the
586 respective elements/attributes of the old structure.
588 Notice the special zebra namespace, and the special elements in this
589 namespace which indicate to the zebra indexer what to do.
591 <z:record id="67ht7" rank="675" type="update">
592 indicates that a new record with given id and static rank has to be updated.
594 <z:index name="title" type="w">
595 encloses all the text/&xml; which shall be indexed in the index named
596 "title" and of index type "w" (see file default.idx in your zebra
608 <!-- Keep this comment at the end of the file
613 sgml-minimize-attributes:nil
614 sgml-always-quote-attributes:t
617 sgml-parent-document: "zebra.xml"
618 sgml-local-catalogs: nil
619 sgml-namecase-general:t