1 <chapter id="record-model-alvisxslt">
2 <!-- $Id: recordmodel-alvisxslt.xml,v 1.15 2007-02-02 11:10:08 marc Exp $ -->
3 <title>ALVIS &xml; Record Model and Filter Module</title>
7 The record model described in this chapter applies to the fundamental,
9 record type <literal>alvis</literal>, introduced in
10 <xref linkend="componentmodulesalvis"/>. The ALVIS &xml; record model
11 is experimental, and it's inner workings might change in future
12 releases of the &zebra; Information Server.
15 <para> This filter has been developed under the
16 <ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
17 the European Community under the "Information Society Technologies"
22 <section id="record-model-alvisxslt-filter">
23 <title>ALVIS Record Filter</title>
25 The experimental, loadable Alvis &xml;/&xslt; filter module
26 <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
27 <literal>libidzebra1.4-mod-alvis</literal>.
28 It is invoked by the <filename>zebra.cfg</filename> configuration statement
30 recordtype.xml: alvis.db/filter_alvis_conf.xml
32 In this example on all data files with suffix
33 <filename>*.xml</filename>, where the
34 Alvis &xslt; filter configuration file is found in the
35 path <filename>db/filter_alvis_conf.xml</filename>.
37 <para>The Alvis &xslt; filter configuration file must be
38 valid &xml;. It might look like this (This example is
39 used for indexing and display of &oai; harvested records):
41 <?xml version="1.0" encoding="UTF-8"?>
43 <schema name="identity" stylesheet="xsl/identity.xsl" />
44 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
45 stylesheet="xsl/oai2index.xsl" />
46 <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
47 <!-- use split level 2 when indexing whole &oai; Record lists -->
48 <split level="2"/>
53 All named stylesheets defined inside
54 <literal>schema</literal> element tags
55 are for presentation after search, including
56 the indexing stylesheet (which is a great debugging help). The
57 names defined in the <literal>name</literal> attributes must be
58 unique, these are the literal <literal>schema</literal> or
59 <literal>element set</literal> names used in
60 <ulink url="http://www.loc.gov/standards/sru/srw/">&srw;</ulink>,
61 <ulink url="&url.sru;">&sru;</ulink> and
62 &z3950; protocol queries.
63 The paths in the <literal>stylesheet</literal> attributes
64 are relative to zebras working directory, or absolute to file
68 The <literal><split level="2"/></literal> decides where the
69 &xml; Reader shall split the
70 collections of records into individual records, which then are
71 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
74 There must be exactly one indexing &xslt; stylesheet, which is
75 defined by the magic attribute
76 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
79 <section id="record-model-alvisxslt-internal">
80 <title>ALVIS Internal Record Representation</title>
81 <para>When indexing, an &xml; Reader is invoked to split the input
82 files into suitable record &xml; pieces. Each record piece is then
83 transformed to an &xml; &dom; structure, which is essentially the
84 record model. Only &xslt; transformations can be applied during
85 index, search and retrieval. Consequently, output formats are
86 restricted to whatever &xslt; can deliver from the record &xml;
87 structure, be it other &xml; formats, HTML, or plain text. In case
88 you have <literal>libxslt1</literal> running with E&xslt; support,
89 you can use this functionality inside the Alvis
90 filter configuration &xslt; stylesheets.
94 <section id="record-model-alvisxslt-canonical">
95 <title>ALVIS Canonical Indexing Format</title>
96 <para>The output of the indexing &xslt; stylesheets must contain
97 certain elements in the magic
98 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
99 namespace. The output of the &xslt; indexing transformation is then
100 parsed using &dom; methods, and the contained instructions are
101 performed on the <emphasis>magic elements and their
105 For example, the output of the command
107 xsltproc xsl/oai2index.xsl one-record.xml
109 might look like this:
111 <?xml version="1.0" encoding="UTF-8"?>
112 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
113 z:id="oai:JTRS:CP-3290---Volume-I"
116 <z:index name="oai_identifier" type="0">
117 oai:JTRS:CP-3290---Volume-I</z:index>
118 <z:index name="oai_datestamp" type="0">2004-07-09</z:index>
119 <z:index name="oai_setspec" type="0">jtrs</z:index>
120 <z:index name="dc_all" type="w">
121 <z:index name="dc_title" type="w">Proceedings of the 4th
122 International Conference and Exhibition:
123 World Congress on Superconductivity - Volume I</z:index>
124 <z:index name="dc_creator" type="w">Kumar Krishen and *Calvin
125 Burnham, Editors</z:index>
130 <para>This means the following: From the original &xml; file
131 <literal>one-record.xml</literal> (or from the &xml; record &dom; of the
132 same form coming from a splitted input file), the indexing
133 stylesheet produces an indexing &xml; record, which is defined by
134 the <literal>record</literal> element in the magic namespace
135 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
136 &zebra; uses the content of
137 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
138 record ID, and - in case static ranking is set - the content of
139 <literal>z:rank="47896"</literal> as static rank. Following the
140 discussion in <xref linkend="administration-ranking"/>
141 we see that this records is internally ordered
142 lexicographically according to the value of the string
143 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
144 The type of action performed during indexing is defined by
145 <literal>z:type="update"></literal>, with recognized values
146 <literal>insert</literal>, <literal>update</literal>, and
147 <literal>delete</literal>.
149 <para>In this example, the following literal indexes are constructed:
158 where the indexing type is defined in the
159 <literal>type</literal> attribute
160 (any value from the standard configuration
161 file <filename>default.idx</filename> will do). Finally, any
162 <literal>text()</literal> node content recursively contained
163 inside the <literal>index</literal> will be filtered through the
164 appropriate charmap for character normalization, and will be
165 inserted in the index.
168 Specific to this example, we see that the single word
169 <literal>oai:JTRS:CP-3290---Volume-I</literal> will be literal,
170 byte for byte without any form of character normalization,
171 inserted into the index named <literal>oai:identifier</literal>,
173 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
174 will be inserted using the <literal>w</literal> character
175 normalization defined in <filename>default.idx</filename> into
176 the index <literal>dc:creator</literal> (that is, after character
177 normalization the index will keep the inidividual words
178 <literal>kumar</literal>, <literal>krishen</literal>,
179 <literal>and</literal>, <literal>calvin</literal>,
180 <literal>burnham</literal>, and <literal>editors</literal>), and
181 finally both the texts
182 <literal>Proceedings of the 4th International Conference and Exhibition:
183 World Congress on Superconductivity - Volume I</literal>
185 <literal>Kumar Krishen and *Calvin Burnham, Editors</literal>
186 will be inserted into the index <literal>dc:all</literal> using
187 the same character normalization map <literal>w</literal>.
190 Finally, this example configuration can be queried using &pqf;
191 queries, either transported by &z3950;, (here using a yaz-client)
194 Z> open localhost:9999
198 Z> f @attr 1=dc_creator Kumar
199 Z> scan @attr 1=dc_creator adam
201 Z> f @attr 1=dc_title @attr 4=2 "proceeding congress superconductivity"
202 Z> scan @attr 1=dc_title abc
206 extentions <literal>x-pquery</literal> and
207 <literal>x-pScanClause</literal> to
211 http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=%40attr+1%3Ddc_creator+%40attr+4%3D6+%22the
212 http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr+1=dc_date+@attr+4=2+a
215 See <xref linkend="zebrasrv-sru"/> for more information on &sru;/&srw;
216 configuration, and <xref linkend="gfs-config"/> or the &yaz;
217 <ulink url="&url.yaz.cql;">&cql; section</ulink>
218 for the details or the &yaz; frontend server.
221 Notice that there are no <filename>*.abs</filename>,
222 <filename>*.est</filename>, <filename>*.map</filename>, or other &grs1;
223 filter configuration files involves in this process, and that the
224 literal index names are used during search and retrieval.
230 <section id="record-model-alvisxslt-conf">
231 <title>ALVIS Record Model Configuration</title>
234 <section id="record-model-alvisxslt-index">
235 <title>ALVIS Indexing Configuration</title>
237 As mentioned above, there can be only one indexing
238 stylesheet, and configuration of the indexing process is a synonym
239 of writing an &xslt; stylesheet which produces &xml; output containing the
240 magic elements discussed in
241 <xref linkend="record-model-alvisxslt-internal"/>.
242 Obviously, there are million of different ways to accomplish this
243 task, and some comments and code snippets are in order to lead
244 our paduans on the right track to the good side of the force.
247 Stylesheets can be written in the <emphasis>pull</emphasis> or
248 the <emphasis>push</emphasis> style: <emphasis>pull</emphasis>
249 means that the output &xml; structure is taken as starting point of
250 the internal structure of the &xslt; stylesheet, and portions of
251 the input &xml; are <emphasis>pulled</emphasis> out and inserted
252 into the right spots of the output &xml; structure. On the other
253 side, <emphasis>push</emphasis> &xslt; stylesheets are recursavly
254 calling their template definitions, a process which is commanded
255 by the input &xml; structure, and avake to produce some output &xml;
256 whenever some special conditions in the input styelsheets are
257 met. The <emphasis>pull</emphasis> type is well-suited for input
258 &xml; with strong and well-defined structure and semantcs, like the
259 following &oai; indexing example, whereas the
260 <emphasis>push</emphasis> type might be the only possible way to
261 sort out deeply recursive input &xml; formats.
264 A <emphasis>pull</emphasis> stylesheet example used to index
265 &oai; harvested records could use some of the following template
269 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
270 xmlns:z="http://indexdata.dk/zebra/xslt/1"
271 xmlns:oai="http://www.openarchives.org/&oai;/2.0/"
272 xmlns:oai_dc="http://www.openarchives.org/&oai;/2.0/oai_dc/"
273 xmlns:dc="http://purl.org/dc/elements/1.1/"
276 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
278 <!-- disable all default text node output -->
279 <xsl:template match="text()"/>
281 <!-- match on oai xml record root -->
282 <xsl:template match="/">
283 <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"
285 <!-- you might want to use z:rank="{some &xslt; function here}" -->
286 <xsl:apply-templates/>
290 <!-- &oai; indexing templates -->
291 <xsl:template match="oai:record/oai:header/oai:identifier">
292 <z:index name="oai_identifier" type="0">
293 <xsl:value-of select="."/>
299 <!-- DC specific indexing templates -->
300 <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title">
301 <z:index name="dc_title" type="w">
302 <xsl:value-of select="."/>
314 that the names and types of the indexes can be defined in the
315 indexing &xslt; stylesheet <emphasis>dynamically according to
316 content in the original &xml; records</emphasis>, which has
317 opportunities for great power and wizardery as well as grande
321 The following excerpt of a <emphasis>push</emphasis> stylesheet
322 <emphasis>might</emphasis>
323 be a good idea according to your strict control of the &xml;
324 input format (due to rigerours checking against well-defined and
325 tight RelaxNG or &xml; Schema's, for example):
328 <xsl:template name="element-name-indexes">
329 <z:index name="{name()}" type="w">
330 <xsl:value-of select="'1'"/>
335 This template creates indexes which have the name of the working
336 node of any input &xml; file, and assigns a '1' to the index.
338 <literal>find @attr 1=xyz 1</literal>
339 finds all files which contain at least one
340 <literal>xyz</literal> &xml; element. In case you can not control
341 which element names the input files contain, you might ask for
342 disaster and bad karma using this technique.
345 One variation over the theme <emphasis>dynamically created
346 indexes</emphasis> will definitely be unwise:
349 <!-- match on oai xml record root -->
350 <xsl:template match="/">
351 <z:record z:type="update">
353 <!-- create dynamic index name from input content -->
354 <xsl:variable name="dynamic_content">
355 <xsl:value-of select="oai:record/oai:header/oai:identifier"/>
358 <!-- create zillions of indexes with unknown names -->
359 <z:index name="{$dynamic_content}" type="w">
360 <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/>
367 Don't be tempted to cross
368 the line to the dark side of the force, paduan; this leads
369 to suffering and pain, and universal
370 disentigration of your project schedule.
374 <section id="record-model-alvisxslt-elementset">
375 <title>ALVIS Exchange Formats</title>
377 An exchange format can be anything which can be the outcome of an
378 &xslt; transformation, as far as the stylesheet is registered in
379 the main Alvis &xslt; filter configuration file, see
380 <xref linkend="record-model-alvisxslt-filter"/>.
381 In principle anything that can be expressed in &xml;, HTML, and
382 TEXT can be the output of a <literal>schema</literal> or
383 <literal>element set</literal> directive during search, as long as
384 the information comes from the
385 <emphasis>original input record &xml; &dom; tree</emphasis>
386 (and not the transformed and <emphasis>indexed</emphasis> &xml;!!).
389 In addition, internal administrative information from the &zebra;
390 indexer can be accessed during record retrieval. The following
391 example is a summary of the possibilities:
394 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
395 xmlns:z="http://indexdata.dk/zebra/xslt/1"
398 <!-- register internal zebra parameters -->
399 <xsl:param name="id" select="''"/>
400 <xsl:param name="filename" select="''"/>
401 <xsl:param name="score" select="''"/>
402 <xsl:param name="schema" select="''"/>
404 <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/>
406 <!-- use then for display of internal information -->
407 <xsl:template match="/">
409 <id><xsl:value-of select="$id"/></id>
410 <filename><xsl:value-of select="$filename"/></filename>
411 <score><xsl:value-of select="$score"/></score>
412 <schema><xsl:value-of select="$schema"/></schema>
423 <section id="record-model-alvisxslt-example">
424 <title>ALVIS Filter &oai; Indexing Example</title>
426 The sourcecode tarball contains a working Alvis filter example in
427 the directory <filename>examples/alvis-oai/</filename>, which
428 should get you started.
431 More example data can be harvested from any &oai; complient server,
432 see details at the &oai;
433 <ulink url="http://www.openarchives.org/">
434 http://www.openarchives.org/</ulink> web site, and the community
436 <ulink url="http://www.openarchives.org/community/index.html">
437 http://www.openarchives.org/community/index.html</ulink>.
440 <ulink url="http://www.oaforum.org/tutorial/">
441 http://www.oaforum.org/tutorial/</ulink>.
453 c) Main "alvis" &xslt; filter config file:
454 cat db/filter_alvis_conf.xml
456 <?xml version="1.0" encoding="UTF8"?>
458 <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
459 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
460 stylesheet="db/alvis2index.xsl" />
461 <schema name="dc" stylesheet="db/alvis2dc.xsl" />
462 <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
463 <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
464 <schema name="help" stylesheet="db/alvis2help.xsl" />
468 the paths are relative to the directory where zebra.init is placed
471 The split level decides where the SAX parser shall split the
472 collections of records into individual records, which then are
473 loaded into &dom;, and have the indexing &xslt; stylesheet applied.
475 The indexing stylesheet is found by it's identifier.
477 All the other stylesheets are for presentation after search.
479 - in data/ a short sample of harvested carnivorous plants
480 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
482 - in root also one single data record - nice for testing the xslt
485 xsltproc db/alvis2index.xsl carni*.xml
489 - in db/ a cql2pqf.txt yaz-client config file
490 which is also used in the yaz-server <ulink url="&url.cql;">&cql;</ulink>-to-&pqf; process
492 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
494 - in db/ an indexing &xslt; stylesheet. This is a PULL-type XSLT thing,
495 as it constructs the new &xml; structure by pulling data out of the
496 respective elements/attributes of the old structure.
498 Notice the special zebra namespace, and the special elements in this
499 namespace which indicate to the zebra indexer what to do.
501 <z:record id="67ht7" rank="675" type="update">
502 indicates that a new record with given id and static rank has to be updated.
504 <z:index name="title" type="w">
505 encloses all the text/&xml; which shall be indexed in the index named
506 "title" and of index type "w" (see file default.idx in your zebra
518 <!-- Keep this comment at the end of the file
523 sgml-minimize-attributes:nil
524 sgml-always-quote-attributes:t
527 sgml-parent-document: "zebra.xml"
528 sgml-local-catalogs: nil
529 sgml-namecase-general:t