1 <chapter id="record-model-alvisxslt">
2 <!-- $Id: recordmodel-alvisxslt.xml,v 1.2 2006-02-15 14:57:48 marc Exp $ -->
3 <title>ALVIS XML Record Model and Filter Module</title>
7 The record model described in this chapter applies to the fundamental,
9 record type <literal>alvis</literal>, introduced in
10 <xref linkend="componentmodulesalvis"/>. The ALVIS XML record model
11 is experimental, and it's inner workings might change in future
12 releases of the Zebra Information Server.
15 <para> This filter has been developed under the
16 <ulink url="http://www.alvis.info/">ALVIS</ulink> project funded by
17 the European Community under the "Information Society Technologies"
18 Programme (2002-2006).
22 <sect1 id="record-model-alvisxslt-filter">
23 <title>ALVIS Record Filter</title>
25 The experimental, loadable Alvis XM/XSLT filter module
26 <literal>mod-alvis.so</literal> is packaged in the GNU/Debian package
27 <literal>libidzebra1.4-mod-alvis</literal>.
28 It is invoked by the zebra configuration statement
30 recordtype.xml: alvis.db/filter_alvis_conf.xml
32 on all data files with suffix <literal>.xml</literal>, where the
33 <literal>alvis</literal> XSLT filter config file is found in the
34 path <literal>db/filter_alvis_conf.xml</literal>
36 <para>The <literal>alvis</literal> XSLT filter config file must be
37 valid XML. It might look like this (used for indexing and display
38 of OAI harvested records):
40 <?xml version="1.0" encoding="UTF-8"?>
42 <schema name="identity" stylesheet="xsl/identity.xsl" />
43 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
44 stylesheet="xsl/oai2index.xsl" />
45 <schema name="dc" stylesheet="xsl/oai2dc.xsl" />
46 <!-- use split level 2 when indexing whole OAI Record lists -->
47 <split level="2"/>
52 All named stylesheets defined inside
53 <literal>schema</literal> element tags
54 are for presentation after search, including
55 the indexing stylesheet (which is a great debugging help). The
56 names defined in the <literal>name</literal> attributes must be
57 unique, these are the literal <literal>schema</literal> or
58 <literal>element set</literal> names used in
59 <ulink url="http://www.loc.gov/standards/sru/srw/">SRW</ulink>,
60 <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink> and
61 Z39.50 protocol queries.
62 The pathes in the <literal>stylesheet</literal> attributes
63 are relative to zebras working directory, or absolute to file
67 The <literal><split level="2"/></literal> decides where the
68 XML Reader shall split the
69 collections of records into individual records, which then are
70 loaded into DOM, and have the indexing XSLT stylesheet applied.
73 There must be exactly one indexing XSLT stylesheet, which is
74 defined by the magic attribute
75 <literal>identifier="http://indexdata.dk/zebra/xslt/1"</literal>.
78 <sect2 id="record-model-alvisxslt-internal">
79 <title>ALVIS Internal Record Representation</title>
80 <para>When indexing, an XML Reader is invoked to split the input
81 files into suitable record XML pieces. Each record piece is then
82 transformed to an XML DOM structire, which is essentially the
83 record model. Only XSLT transfomations can be applied during
84 index, search and retrieval. Consequently, output formats are
85 restricted to whatever XSLT can deliver from the record XML
86 structure, be it other XML formats, HTML, or plain text. In case
87 you have <literal>libxslt1</literal> running with EXSLT support,
88 you can use this functionality inside the <literal>alvis</literal>
89 filter configuraiton XSLT stylesheets.
93 <sect2 id="record-model-alvisxslt-canonical">
94 <title>ALVIS Canonical Indexing Format</title>
95 <para>The output of the indexing XSLT stylesheets must contain
96 certain elements in the magic
97 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>
98 namespace. The output of the XSLT indexing transformation is then
99 parsed using DOM methods, and the contained instructions are
100 performed on the <emphasis>magic elements and their
104 For example, the output of the command
106 xsltproc xsl/oai2index.xsl one-record.xml
108 might look like this:
110 <?xml version="1.0" encoding="UTF-8"?>
111 <z:record xmlns:z="http://indexdata.dk/zebra/xslt/1"
112 z:id="oai:JTRS:CP-3290---Volume-I"
115 <z:index name="oai:identifier" type="0">
116 oai:JTRS:CP-3290---Volume-I</z:index>
117 <z:index name="oai:datestamp" type="0">2004-07-09</z:index>
118 <z:index name="oai:setspec" type="0">jtrs</z:index>
119 <z:index name="dc:all" type="w">
120 <z:index name="dc:title" type="w">Proceedings of the 4th
121 International Conference and Exhibition:
122 World Congress on Superconductivity - Volume I</z:index>
123 <z:index name="dc:creator" type="w">Kumar Krishen and *Calvin
124 Burnham, Editors</z:index>
129 <para>This means the following: From the original XML file
130 <literal>one-record.xml</literal> (or from the XML record DOM of the
131 same form coming from a splitted input file), the indexing
132 stylesheet produces an indexing XML record, which is defined by
133 the <literal>record</literal> element in the magic namespace
134 <literal>xmlns:z="http://indexdata.dk/zebra/xslt/1"</literal>.
135 Zebra uses the content of
136 <literal>z:id="oai:JTRS:CP-3290---Volume-I"</literal> as internal
137 record ID, and - in case static ranking is set - the content of
138 <literal>z:rank="47896"</literal> as static rank. Following the
139 discussion in XXX we see that this records is internally ordered
140 lexicographically according to the value of the string
141 <literal>oai:JTRS:CP-3290---Volume-I47896</literal>.
142 The type of action performed during indexing is defined by
143 <literal>z:type="update"></literal>, with recognized values
144 <literal>insert</literal>, <literal>update</literal>, and
145 <literal>delete</literal>.
147 <para>Then the following literal indexes are constructed:
156 where the indexing type is defined in the
157 <literal>type</literal> attribute (any value from the standard config
158 file<literal>default.idx</literal> will do). Finally, any
159 <literal>text()</literal> node content recursively contained
160 inside the <literal>index</literal> will be filtered through the
161 appropriate charmap for character normalization, and will be
162 inserted in the index.
165 Notice that there are no <literal>.abs</literal>,
166 <literal>.est</literal>, <literal>.map</literal>, or other GRS-1
167 filter configuration files involves in this process. Notice also,
168 that the names and types of the indexes can be defined in the
169 indexing XSLT stylesheet <emphasis>dynamically according to
170 content in the original XML records</emphasis>, which has
171 oppertunities for great power and great disaster.
177 <sect1 id="record-model-alvisxslt-conf">
178 <title>ALVIS Record Model Configuration</title>
181 <sect2 id="record-model-alvisxslt-index">
182 <title>ALVIS Indexing Configuration</title>
191 <sect2 id="record-model-alvisxslt-elementset">
192 <title>ALVIS Exchange Formats</title>
204 c) Main "alvis" XSLT filter config file:
205 cat db/filter_alvis_conf.xml
207 <?xml version="1.0" encoding="UTF8"?>
209 <schema name="alvis" stylesheet="db/alvis2alvis.xsl" />
210 <schema name="index" identifier="http://indexdata.dk/zebra/xslt/1"
211 stylesheet="db/alvis2index.xsl" />
212 <schema name="dc" stylesheet="db/alvis2dc.xsl" />
213 <schema name="dc-short" stylesheet="db/alvis2dc_short.xsl" />
214 <schema name="snippet" snippet="25" stylesheet="db/alvis2snippet.xsl" />
215 <schema name="help" stylesheet="db/alvis2help.xsl" />
219 the pathes are relative to the directory where zebra.init is placed
222 The split level decides where the SAX parser shall split the
223 collections of records into individual records, which then are
224 loaded into DOM, and have the indexing XSLT stylesheet applied.
226 The indexing stylesheet is found by it's identifier.
228 All the other stylesheets are for presentation after search.
230 - in data/ a short sample of harvested carnivorous plants
231 ZEBRA_INDEX_DIRS=data/carnivor_20050118_2200_short-346.xml
233 - in root also one single data record - nice for testing the xslt
236 xsltproc db/alvis2index.xsl carni*.xml
240 - in db/ a cql2pqf.txt yaz-client config file
241 which is also used in the yaz-server <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>-to-PQF process
243 see: http://www.indexdata.com/yaz/doc/tools.tkl#tools.cql.map
245 - in db/ an indexing XSLT stylesheet. This is a PULL-type XSLT thing,
246 as it constructs the new XML structure by pulling data out of the
247 respective elements/attributes of the old structure.
249 Notice the special zebra namespace, and the special elements in this
250 namespace which indicate to the zebra indexer what to do.
252 <z:record id="67ht7" rank="675" type="update">
253 indicates that a new record with given id and static rank has to be updated.
255 <z:index name="title" type="w">
256 encloses all the text/XML which shall be indexed in the index named
257 "title" and of index type "w" (see file default.idx in your zebra
269 <!-- Keep this comment at the end of the file
274 sgml-minimize-attributes:nil
275 sgml-always-quote-attributes:t
278 sgml-parent-document: "zebra.xml"
279 sgml-local-catalogs: nil
280 sgml-namecase-general:t