doc/recordmodel.xml

   1 <chapter id="record-model">
   2  <!-- $Id: recordmodel.xml,v 1.1 2002-04-09 13:26:26 adam Exp $ -->
   3  <title>The Record Model</title>
   4
   5  <para>
   6   The Zebra system is designed to support a wide range of data management
   7   applications. The system can be configured to handle virtually any
   8   kind of structured data. Each record in the system is associated with
   9   a <emphasis>record schema</emphasis> which lends context to the data
  10   elements of the record.
  11   Any number of record schema can coexist in the system.
  12   Although it may be wise to use only a single schema within
  13   one database, the system poses no such restrictions.
  14  </para>
  15
  16  <para>
  17   The record model described in this chapter applies to the fundamental,
  18   structured
  19   record type <literal>grs</literal> as introduced in
  20   <xref linkend="record-types"/>.
  21  </para>
  22
  23  <para>
  24   Records pass through three different states during processing in the
  25   system.
  26  </para>
  27
  28  <para>
  29
  30   <itemizedlist>
  31    <listitem>
  32
  33     <para>
  34      When records are accessed by the system, they are represented
  35      in their local, or native format. This might be SGML or HTML files,
  36      News or Mail archives, MARC records. If the system doesn't already
  37      know how to read the type of data you need to store, you can set up an
  38      input filter by preparing conversion rules based on regular
  39      expressions and possibly augmented by a flexible scripting language
  40      (Tcl).
  41      The input filter produces as output an internal representation:
  42
  43     </para>
  44    </listitem>
  45    <listitem>
  46
  47     <para>
  48      When records are processed by the system, they are represented
  49      in a tree-structure, constructed by tagged data elements hanging off a
  50      root node. The tagged elements may contain data or yet more tagged
  51      elements in a recursive structure. The system performs various
  52      actions on this tree structure (indexing, element selection, schema
  53      mapping, etc.),
  54
  55     </para>
  56    </listitem>
  57    <listitem>
  58
  59     <para>
  60      Before transmitting records to the client, they are first
  61      converted from the internal structure to a form suitable for exchange
  62      over the network - according to the Z39.50 standard.
  63     </para>
  64    </listitem>
  65
  66   </itemizedlist>
  67
  68  </para>
  69
  70  <sect1 id="local-representation">
  71   <title>Local Representation</title>
  72
  73   <para>
  74    As mentioned earlier, Zebra places few restrictions on the type of
  75    data that you can index and manage. Generally, whatever the form of
  76    the data, it is parsed by an input filter specific to that format, and
  77    turned into an internal structure that Zebra knows how to handle. This
  78    process takes place whenever the record is accessed - for indexing and
  79    retrieval.
  80   </para>
  81
  82   <para>
  83    The RecordType parameter in the <literal>zebra.cfg</literal> file, or
  84    the <literal>-t</literal> option to the indexer tells Zebra how to
  85    process input records.
  86    Two basic types of processing are available - raw text and structured
  87    data. Raw text is just that, and it is selected by providing the
  88    argument <emphasis>text</emphasis> to Zebra. Structured records are
  89    all handled internally using the basic mechanisms described in the
  90    subsequent sections.
  91    Zebra can read structured records in many different formats.
  92    How this is done is governed by additional parameters after the
  93    "grs" keyboard, separated by "." characters.
  94   </para>
  95
  96   <para>
  97    Four basic subtypes to the <emphasis>grs</emphasis> type are
  98    currently available:
  99   </para>
 100
 101   <para>
 102    <variablelist>
 103     <varlistentry>
 104      <term>grs.sgml</term>
 105      <listitem>
 106       <para>
 107        This is the canonical input format &mdash;
 108        described below. It is a simple SGML-like syntax.
 109       </para>
 110      </listitem>
 111     </varlistentry>
 112     <varlistentry>
 113      <term>grs.regx.<emphasis>filter</emphasis></term>
 114      <listitem>
 115       <para>
 116        This enables a user-supplied input
 117        filter. The mechanisms of these filters are described below.
 118       </para>
 119      </listitem>
 120     </varlistentry>
 121     <varlistentry>
 122      <term>grs.tcl.<emphasis>filter</emphasis></term>
 123      <listitem>
 124       <para>
 125        Similar to grs.regx but using Tcl for rules.
 126       </para>
 127      </listitem>
 128     </varlistentry>
 129     <varlistentry>
 130      <term>grs.marc.<emphasis>abstract syntax</emphasis></term>
 131      <listitem>
 132       <para>
 133        This allows Zebra to read
 134        records in the ISO2709 (MARC) encoding standard. In this case, the
 135        last paramemeter <emphasis>abstract syntax</emphasis> names the
 136        <literal>.abs</literal> file (see below)
 137        which describes the specific MARC structure of the input record as
 138        well as the indexing rules.
 139       </para>
 140      </listitem>
 141     </varlistentry>
 142    </variablelist>
 143   </para>
 144
 145   <sect2>
 146    <title>Canonical Input Format</title>
 147
 148    <para>
 149     Although input data can take any form, it is sometimes useful to
 150     describe the record processing capabilities of the system in terms of
 151     a single, canonical input format that gives access to the full
 152     spectrum of structure and flexibility in the system. In Zebra, this
 153     canonical format is an "SGML-like" syntax.
 154    </para>
 155
 156    <para>
 157     To use the canonical format specify <literal>grs.sgml</literal> as
 158     the record type.
 159    </para>
 160
 161    <para>
 162     Consider a record describing an information resource (such a record is
 163     sometimes known as a <emphasis>locator record</emphasis>).
 164     It might contain a field describing the distributor of the
 165     information resource, which might in turn be partitioned into
 166     various fields providing details about the distributor, like this:
 167    </para>
 168
 169    <para>
 170
 171     <screen>
 172      &#60;Distributor&#62;
 173      &#60;Name&#62; USGS/WRD &#60;/Name&#62;
 174      &#60;Organization&#62; USGS/WRD &#60;/Organization&#62;
 175      &#60;Street-Address&#62;
 176      U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW
 177      &#60;/Street-Address&#62;
 178      &#60;City&#62; ALBUQUERQUE &#60;/City&#62;
 179      &#60;State&#62; NM &#60;/State&#62;
 180      &#60;Zip-Code&#62; 87102 &#60;/Zip-Code&#62;
 181      &#60;Country&#62; USA &#60;/Country&#62;
 182      &#60;Telephone&#62; (505) 766-5560 &#60;/Telephone&#62;
 183      &#60;/Distributor&#62;
 184     </screen>
 185
 186    </para>
 187
 188    <note>
 189    <para>
 190     The indentation used above is used to illustrate how Zebra
 191      interprets the markup. The indentation, in itself, has no
 192      significance to the parser for the canonical input format, which
 193      discards superfluous whitespace.
 194     </para>
 195    </note>
 196    <para>
 197     The keywords surrounded by &lt;...&gt; are
 198     <emphasis>tags</emphasis>, while the sections of text
 199     in between are the <emphasis>data elements</emphasis>.
 200     A data element is characterized by its location in the tree
 201     that is made up by the nested elements.
 202     Each element is terminated by a closing tag - beginning
 203     with <literal>&#60;</literal>/, and containing the same symbolic
 204     tag-name as the corresponding opening tag.
 205     The general closing tag - <literal>&#60;</literal>&gt;/ -
 206     terminates the element started by the last opening tag. The
 207     structuring of elements is significant.
 208     The element <emphasis>Telephone</emphasis>,
 209     for instance, may be indexed and presented to the client differently,
 210     depending on whether it appears inside the
 211     <emphasis>Distributor</emphasis> element, or some other,
 212     structured data element such a <emphasis>Supplier</emphasis> element.
 213    </para>
 214
 215    <sect3>
 216     <title>Record Root</title>
 217
 218     <para>
 219      The first tag in a record describes the root node of the tree that
 220      makes up the total record. In the canonical input format, the root tag
 221      should contain the name of the schema that lends context to the
 222      elements of the record
 223      (see <xref linkend="internal-representation"/>).
 224       The following is a GILS record that
 225       contains only a single element (strictly speaking, that makes it an
 226       illegal GILS record, since the GILS profile includes several mandatory
 227       elements - Zebra does not validate the contents of a record against
 228       the Z39.50 profile, however - it merely attempts to match up elements
 229       of a local representation with the given schema):
 230     </para>
 231
 232     <para>
 233
 234      <screen>
 235       &#60;gils&#62;
 236       &#60;title&#62;Zen and the Art of Motorcycle Maintenance&#60;/title&#62;
 237       &#60;/gils&#62;
 238      </screen>
 239
 240     </para>
 241
 242    </sect3>
 243
 244    <sect3>
 245     <title>Variants</title>
 246
 247     <para>
 248      Zebra allows you to provide individual data elements in a number of
 249      <emphasis>variant forms</emphasis>. Examples of variant forms are
 250      textual data elements which might appear in different languages, and
 251      images which may appear in different formats or layouts.
 252      The variant system in Zebra is essentially a representation of
 253      the variant mechanism of Z39.50-1995.
 254     </para>
 255
 256     <para>
 257      The following is an example of a title element which occurs in two
 258      different languages.
 259     </para>
 260
 261     <para>
 262
 263      <screen>
 264       &#60;title&#62;
 265       &#60;var lang lang "eng"&#62;
 266       Zen and the Art of Motorcycle Maintenance&#60;/&#62;
 267       &#60;var lang lang "dan"&#62;
 268       Zen og Kunsten at Vedligeholde en Motorcykel&#60;/&#62;
 269       &#60;/title&#62;
 270      </screen>
 271
 272     </para>
 273
 274     <para>
 275      The syntax of the <emphasis>variant element</emphasis> is
 276      <literal>&lt;var class type value&gt;</literal>.
 277      The available values for the <emphasis>class</emphasis> and
 278      <emphasis>type</emphasis> fields are given by the variant set
 279      that is associated with the current schema
 280      (see <xref linkend="variant-set"/>).
 281     </para>
 282
 283     <para>
 284      Variant elements are terminated by the general end-tag &#60;/&#62;, by
 285      the variant end-tag &#60;/var&#62;, by the appearance of another variant
 286      tag with the same <emphasis>class</emphasis> and
 287      <emphasis>value</emphasis> settings, or by the
 288      appearance of another, normal tag. In other words, the end-tags for
 289      the variants used in the example above could have been saved.
 290     </para>
 291
 292     <para>
 293      Variant elements can be nested. The element
 294     </para>
 295
 296     <para>
 297
 298      <screen>
 299       &#60;title&#62;
 300       &#60;var lang lang "eng"&#62;&#60;var body iana "text/plain"&#62;
 301       Zen and the Art of Motorcycle Maintenance
 302       &#60;/title&#62;
 303      </screen>
 304
 305     </para>
 306
 307     <para>
 308      Associates two variant components to the variant list for the title
 309      element.
 310     </para>
 311
 312     <para>
 313      Given the nesting rules described above, we could write
 314     </para>
 315
 316     <para>
 317
 318      <screen>
 319       &#60;title&#62;
 320       &#60;var body iana "text/plain&#62;
 321       &#60;var lang lang "eng"&#62;
 322       Zen and the Art of Motorcycle Maintenance
 323       &#60;var lang lang "dan"&#62;
 324       Zen og Kunsten at Vedligeholde en Motorcykel
 325       &#60;/title&#62;
 326      </screen>
 327
 328     </para>
 329
 330     <para>
 331      The title element above comes in two variants. Both have the IANA body
 332      type "text/plain", but one is in English, and the other in
 333      Danish. The client, using the element selection mechanism of Z39.50,
 334      can retrieve information about the available variant forms of data
 335      elements, or it can select specific variants based on the requirements
 336      of the end-user.
 337     </para>
 338
 339    </sect3>
 340
 341   </sect2>
 342
 343   <sect2>
 344    <title>Input Filters</title>
 345
 346    <para>
 347     In order to handle general input formats, Zebra allows the
 348     operator to define filters which read individual records in their
 349     native format and produce an internal representation that the system
 350     can work with.
 351    </para>
 352
 353    <para>
 354     Input filters are ASCII files, generally with the suffix
 355     <literal>.flt</literal>.
 356     The system looks for the files in the directories given in the
 357     <emphasis>profilePath</emphasis> setting in the
 358     <literal>zebra.cfg</literal> files.
 359     The record type for the filter is
 360     <literal>grs.regx.</literal><emphasis>filter-filename</emphasis>
 361     (fundamental type <literal>grs</literal>, file read
 362     type <literal>regx</literal>, argument
 363     <emphasis>filter-filename</emphasis>).
 364    </para>
 365
 366    <para>
 367     Generally, an input filter consists of a sequence of rules, where each
 368     rule consists of a sequence of expressions, followed by an action. The
 369     expressions are evaluated against the contents of the input record,
 370     and the actions normally contribute to the generation of an internal
 371     representation of the record.
 372    </para>
 373
 374    <para>
 375     An expression can be either of the following:
 376    </para>
 377
 378    <para>
 379     <variablelist>
 380
 381      <varlistentry>
 382       <term>INIT</term>
 383       <listitem>
 384        <para>
 385         The action associated with this expression is evaluated
 386         exactly once in the lifetime of the application, before any records
 387         are read. It can be used in conjunction with an action that
 388         initializes tables or other resources that are used in the processing
 389         of input records.
 390        </para>
 391       </listitem>
 392      </varlistentry>
 393      <varlistentry>
 394       <term>BEGIN</term>
 395       <listitem>
 396        <para>
 397         Matches the beginning of the record. It can be used to
 398         initialize variables, etc. Typically, the
 399         <emphasis>BEGIN</emphasis> rule is also used
 400         to establish the root node of the record.
 401        </para>
 402       </listitem>
 403      </varlistentry>
 404      <varlistentry>
 405       <term>END</term>
 406       <listitem>
 407        <para>
 408         Matches the end of the record - when all of the contents
 409         of the record has been processed.
 410        </para>
 411       </listitem>
 412      </varlistentry>
 413      <varlistentry>
 414       <term>/pattern/</term>
 415       <listitem>
 416        <para>
 417         Matches a string of characters from the input record.
 418        </para>
 419       </listitem>
 420      </varlistentry>
 421      <varlistentry>
 422       <term>BODY</term>
 423       <listitem>
 424        <para>
 425         This keyword may only be used between two patterns.
 426         It matches everything between (not including) those patterns.
 427        </para>
 428       </listitem>
 429      </varlistentry>
 430      <varlistentry>
 431       <term>FINISH</term>
 432       <listitem>
 433        <para>
 434         The expression asssociated with this pattern is evaluated
 435         once, before the application terminates. It can be used to release
 436         system resources - typically ones allocated in the
 437         <emphasis>INIT</emphasis> step.
 438        </para>
 439       </listitem>
 440      </varlistentry>
 441     </variablelist>
 442    </para>
 443
 444    <para>
 445     An action is surrounded by curly braces (&lcub;...&rcub;), and
 446     consists of a sequence of statements. Statements may be separated
 447     by newlines or semicolons (;).
 448     Within actions, the strings that matched the expressions
 449     immediately preceding the action can be referred to as
 450     &dollar;0, &dollar;1, &dollar;2, etc.
 451    </para>
 452
 453    <para>
 454     The available statements are:
 455    </para>
 456
 457    <para>
 458     <variablelist>
 459
 460      <varlistentry>
 461       <term>begin <emphasis>type &lsqb;parameter ... &rsqb;</emphasis></term>
 462       <listitem>
 463        <para>
 464         Begin a new
 465         data element. The type is one of the following:
 466         <variablelist>
 467
 468          <varlistentry>
 469           <term>record</term>
 470           <listitem>
 471            <para>
 472             Begin a new record. The followingparameter should be the
 473             name of the schema that describes the structure of the record, eg.
 474             <literal>gils</literal> or <literal>wais</literal> (see below).
 475             The <literal>begin record</literal> call should precede
 476             any other use of the <emphasis>begin</emphasis> statement.
 477            </para>
 478           </listitem>
 479          </varlistentry>
 480          <varlistentry>
 481           <term>element</term>
 482           <listitem>
 483            <para>
 484             Begin a new tagged element. The parameter is the
 485             name of the tag. If the tag is not matched anywhere in the tagsets
 486             referenced by the current schema, it is treated as a local string
 487             tag.
 488            </para>
 489           </listitem>
 490          </varlistentry>
 491          <varlistentry>
 492           <term>variant</term>
 493           <listitem>
 494            <para>
 495             Begin a new node in a variant tree. The parameters are
 496             <emphasis>class type value</emphasis>.
 497            </para>
 498           </listitem>
 499          </varlistentry>
 500         </variablelist>
 501        </para>
 502       </listitem>
 503      </varlistentry>
 504      <varlistentry>
 505       <term>data</term>
 506       <listitem>
 507        <para>
 508         Create a data element. The concatenated arguments make
 509         up the value of the data element.
 510         The option <literal>-text</literal> signals that
 511         the layout (whitespace) of the data should be retained for
 512         transmission.
 513         The option <literal>-element</literal>
 514         <emphasis>tag</emphasis> wraps the data up in
 515         the <emphasis>tag</emphasis>.
 516         The use of the <literal>-element</literal> option is equivalent to
 517         preceding the command with a <emphasis>begin
 518          element</emphasis> command, and following
 519         it with the <emphasis>end</emphasis> command.
 520        </para>
 521       </listitem>
 522      </varlistentry>
 523      <varlistentry>
 524       <term>end <emphasis>&lsqb;type&rsqb;</emphasis></term>
 525       <listitem>
 526        <para>
 527         Close a tagged element. If no parameter is given,
 528         the last element on the stack is terminated.
 529         The first parameter, if any, is a type name, similar
 530         to the <emphasis>begin</emphasis> statement.
 531         For the <emphasis>element</emphasis> type, a tag
 532         name can be provided to terminate a specific tag.
 533        </para>
 534       </listitem>
 535      </varlistentry>
 536     </variablelist>
 537    </para>
 538
 539    <para>
 540     The following input filter reads a Usenet news file, producing a
 541     record in the WAIS schema. Note that the body of a news posting is
 542     separated from the list of headers by a blank line (or rather a
 543     sequence of two newline characters.
 544    </para>
 545
 546    <para>
 547
 548     <screen>
 549      BEGIN                { begin record wais }
 550
 551      /^From:/ BODY /$/    { data -element name $1 }
 552      /^Subject:/ BODY /$/ { data -element title $1 }
 553      /^Date:/ BODY /$/    { data -element lastModified $1 }
 554      /\n\n/ BODY END      {
 555          begin element bodyOfDisplay
 556          begin variant body iana "text/plain"
 557          data -text $1
 558          end record
 559        }
 560     </screen>
 561
 562    </para>
 563
 564    <para>
 565     If Zebra is compiled with support for Tcl (Tool Command Language)
 566     enabled, the statements described above are supplemented with a complete
 567     scripting environment, including control structures (conditional
 568     expressions and loop constructs), and powerful string manipulation
 569     mechanisms for modifying the elements of a record. Tcl is a popular
 570     scripting environment, with several tutorials available both online
 571     and in hardcopy.
 572    </para>
 573
 574   </sect2>
 575
 576  </sect1>
 577
 578  <sect1 id="internal-representation">
 579   <title>Internal Representation</title>
 580
 581   <para>
 582    When records are manipulated by the system, they're represented in a
 583    tree-structure, with data elements at the leaf nodes, and tags or
 584    variant components at the non-leaf nodes. The root-node identifies the
 585    schema that lends context to the tagging and structuring of the
 586    record. Imagine a simple record, consisting of a 'title' element and
 587    an 'author' element:
 588   </para>
 589
 590   <para>
 591
 592    <screen>
 593     TITLE     "Zen and the Art of Motorcycle Maintenance"
 594     ROOT
 595     AUTHOR    "Robert Pirsig"
 596    </screen>
 597
 598   </para>
 599
 600   <para>
 601    A slightly more complex record would have the author element consist
 602    of two elements, a surname and a first name:
 603   </para>
 604
 605   <para>
 606
 607    <screen>
 608     TITLE     "Zen and the Art of Motorcycle Maintenance"
 609     ROOT
 610     FIRST-NAME "Robert"
 611     AUTHOR
 612     SURNAME    "Pirsig"
 613    </screen>
 614
 615   </para>
 616
 617   <para>
 618    The root of the record will refer to the record schema that describes
 619    the structuring of this particular record. The schema defines the
 620    element tags (TITLE, FIRST-NAME, etc.) that may occur in the record, as
 621    well as the structuring (SURNAME should appear below AUTHOR, etc.). In
 622    addition, the schema establishes element set names that are used by
 623    the client to request a subset of the elements of a given record. The
 624    schema may also establish rules for converting the record to a
 625    different schema, by stating, for each element, a mapping to a
 626    different tag path.
 627   </para>
 628
 629   <sect2>
 630    <title>Tagged Elements</title>
 631
 632    <para>
 633     A data element is characterized by its tag, and its position in the
 634     structure of the record. For instance, while the tag "telephone
 635     number" may be used different places in a record, we may need to
 636     distinguish between these occurrences, both for searching and
 637     presentation purposes. For instance, while the phone numbers for the
 638     "customer" and the "service provider" are both
 639     representatives for the same type of resource (a telephone number), it
 640     is essential that they be kept separate. The record schema provides
 641     the structure of the record, and names each data element (defined by
 642     the sequence of tags - the tag path - by which the element can be
 643     reached from the root of the record).
 644    </para>
 645
 646   </sect2>
 647
 648   <sect2>
 649    <title>Variants</title>
 650
 651    <para>
 652     The children of a tag node may be either more tag nodes, a data node
 653     (possibly accompanied by tag nodes),
 654     or a tree of variant nodes. The children of  variant nodes are either
 655     more variant nodes or a data node (possibly accompanied by more
 656     variant nodes). Each leaf node, which is normally a
 657     data node, corresponds to a <emphasis>variant form</emphasis> of the
 658     tagged element identified by the tag which parents the variant tree.
 659     The following title element occurs in two different languages:
 660    </para>
 661
 662    <para>
 663
 664     <screen>
 665      VARIANT LANG=ENG  "War and Peace"
 666      TITLE
 667      VARIANT LANG=DAN  "Krig og Fred"
 668     </screen>
 669
 670    </para>
 671
 672    <para>
 673     Which of the two elements are transmitted to the client by the server
 674     depends on the specifications provided by the client, if any.
 675    </para>
 676
 677    <para>
 678     In practice, each variant node is associated with a triple of class,
 679     type, value, corresponding to the variant mechanism of Z39.50.
 680    </para>
 681
 682   </sect2>
 683
 684   <sect2>
 685    <title>Data Elements</title>
 686
 687    <para>
 688     Data nodes have no children (they are always leaf nodes in the record
 689     tree).
 690    </para>
 691
 692    <note>
 693     <para>
 694      Documentation needs extension here about types of nodes - numerical,
 695      textual, etc., plus the various types of inclusion notes.
 696     </para>
 697    </note>
 698
 699   </sect2>
 700
 701  </sect1>
 702
 703  <sect1 id="data-model">
 704   <title>Configuring Your Data Model</title>
 705
 706   <para>
 707    The following sections describe the configuration files that govern
 708    the internal management of data records. The system searches for the files
 709    in the directories specified by the <emphasis>profilePath</emphasis>
 710    setting in the <literal>zebra.cfg</literal> file.
 711   </para>
 712
 713   <sect2>
 714    <title>The Abstract Syntax</title>
 715
 716    <para>
 717     The abstract syntax definition (also known as an Abstract Record
 718     Structure, or ARS) is the focal point of the
 719     record schema description. For a given schema, the ABS file may state any
 720     or all of the following:
 721    </para>
 722
 723    <para>
 724
 725     <itemizedlist>
 726      <listitem>
 727
 728       <para>
 729        The object identifier of the Z39.50 schema associated
 730        with the ARS, so that it can be referred to by the client.
 731       </para>
 732      </listitem>
 733
 734      <listitem>
 735       <para>
 736        The attribute set (which can possibly be a compound of multiple
 737        sets) which applies in the profile. This is used when indexing and
 738        searching the records belonging to the given profile.
 739       </para>
 740      </listitem>
 741
 742      <listitem>
 743       <para>
 744        The Tag set (again, this can consist of several different sets).
 745        This is used when reading the records from a file, to recognize the
 746        different tags, and when transmitting the record to the client -
 747        mapping the tags to their numerical representation, if they are
 748        known.
 749       </para>
 750      </listitem>
 751
 752      <listitem>
 753       <para>
 754        The variant set which is used in the profile. This provides a
 755        vocabulary for specifying the <emphasis>forms</emphasis> of data that appear inside
 756        the records.
 757       </para>
 758      </listitem>
 759
 760      <listitem>
 761       <para>
 762        Element set names, which are a shorthand way for the client to
 763        ask for a subset of the data elements contained in a record. Element
 764        set names, in the retrieval module, are mapped to <emphasis>element
 765         specifications</emphasis>, which contain information equivalent to the
 766        <emphasis>Espec-1</emphasis> syntax of Z39.50.
 767       </para>
 768      </listitem>
 769
 770      <listitem>
 771       <para>
 772        Map tables, which may specify mappings to
 773        <emphasis>other</emphasis> database profiles, if desired.
 774       </para>
 775      </listitem>
 776
 777      <listitem>
 778       <para>
 779        Possibly, a set of rules describing the mapping of elements to a
 780        MARC representation.
 781
 782       </para>
 783      </listitem>
 784
 785      <listitem>
 786       <para>
 787        A list of element descriptions (this is the actual ARS of the
 788        schema, in Z39.50 terms), which lists the ways in which the various
 789        tags can be used and organized hierarchically.
 790       </para>
 791      </listitem>
 792
 793     </itemizedlist>
 794
 795    </para>
 796
 797    <para>
 798     Several of the entries above simply refer to other files, which
 799     describe the given objects.
 800    </para>
 801
 802   </sect2>
 803
 804   <sect2>
 805    <title>The Configuration Files</title>
 806
 807    <para>
 808     This section describes the syntax and use of the various tables which
 809     are used by the retrieval module.
 810    </para>
 811
 812    <para>
 813     The number of different file types may appear daunting at first, but
 814     each type corresponds fairly clearly to a single aspect of the Z39.50
 815     retrieval facilities. Further, the average database administrator,
 816     who is simply reusing an existing profile for which tables already
 817     exist, shouldn't have to worry too much about the contents of these tables.
 818    </para>
 819
 820    <para>
 821     Generally, the files are simple ASCII files, which can be maintained
 822     using any text editor. Blank lines, and lines beginning with a (&num;) are
 823     ignored. Any characters on a line followed by a (&num;) are also ignored.
 824     All other lines contain <emphasis>directives</emphasis>, which provide
 825     some setting or value to the system.
 826     Generally, settings are characterized by a single
 827     keyword, identifying the setting, followed by a number of parameters.
 828     Some settings are repeatable (r), while others may occur only once in a
 829     file. Some settings are optional (o), whicle others again are
 830     mandatory (m).
 831    </para>
 832
 833   </sect2>
 834
 835   <sect2>
 836    <title>The Abstract Syntax (.abs) Files</title>
 837
 838    <para>
 839     The name of this file type is slightly misleading in Z39.50 terms,
 840     since, apart from the actual abstract syntax of the profile, it also
 841     includes most of the other definitions that go into a database
 842     profile.
 843    </para>
 844
 845    <para>
 846     When a record in the canonical, SGML-like format is read from a file
 847     or from the database, the first tag of the file should reference the
 848     profile that governs the layout of the record. If the first tag of the
 849     record is, say, <literal>&lt;gils&gt;</literal>, the system will look
 850     for the profile definition in the file <literal>gils.abs</literal>.
 851     Profile definitions are cached, so they only have to be read once
 852     during the lifespan of the current process.
 853    </para>
 854
 855    <para>
 856     When writing your own input filters, the
 857     <emphasis>record-begin</emphasis> command
 858     introduces the profile, and should always be called first thing when
 859     introducing a new record.
 860    </para>
 861
 862    <para>
 863     The file may contain the following directives:
 864    </para>
 865
 866    <para>
 867     <variablelist>
 868
 869      <varlistentry>
 870       <term>name <emphasis>symbolic-name</emphasis></term>
 871       <listitem>
 872        <para>
 873         (m) This provides a shorthand name or
 874         description for the profile. Mostly useful for diagnostic purposes.
 875        </para>
 876       </listitem>
 877      </varlistentry>
 878      <varlistentry>
 879       <term>reference <emphasis>OID-name</emphasis></term>
 880       <listitem>
 881        <para>
 882         (m) The reference name of the OID for the profile.
 883         The reference names can be found in the <emphasis>util</emphasis>
 884         module of <emphasis>YAZ</emphasis>.
 885        </para>
 886       </listitem>
 887      </varlistentry>
 888      <varlistentry>
 889       <term>attset <emphasis>filename</emphasis></term>
 890       <listitem>
 891        <para>
 892         (m) The attribute set that is used for
 893         indexing and searching records belonging to this profile.
 894        </para>
 895       </listitem>
 896      </varlistentry>
 897      <varlistentry>
 898       <term>tagset <emphasis>filename</emphasis></term>
 899       <listitem>
 900        <para>
 901         (o) The tag set (if any) that describe
 902         that fields of the records.
 903        </para>
 904       </listitem>
 905      </varlistentry>
 906      <varlistentry>
 907       <term>varset <emphasis>filename</emphasis></term>
 908       <listitem>
 909        <para>
 910         (o) The variant set used in the profile.
 911        </para>
 912       </listitem>
 913      </varlistentry>
 914      <varlistentry>
 915       <term>maptab <emphasis>filename</emphasis></term>
 916       <listitem>
 917        <para>
 918         (o,r) This points to a
 919         conversion table that might be used if the client asks for the record
 920         in a different schema from the native one.
 921        </para>
 922       </listitem></varlistentry>
 923      <varlistentry>
 924       <term>marc <emphasis>filename</emphasis></term>
 925       <listitem>
 926        <para>
 927         (o) Points to a file containing parameters
 928         for representing the record contents in the ISO2709 syntax. Read the
 929         description of the MARC representation facility below.
 930        </para>
 931       </listitem></varlistentry>
 932      <varlistentry>
 933       <term>esetname <emphasis>name filename</emphasis></term>
 934       <listitem>
 935        <para>
 936         (o,r) Associates the
 937         given element set name with an element selection file. If an (@) is
 938         given in place of the filename, this corresponds to a null mapping for
 939         the given element set name.
 940        </para>
 941       </listitem></varlistentry>
 942      <varlistentry>
 943       <term>any <emphasis>tags</emphasis></term>
 944       <listitem>
 945        <para>
 946         (o) This directive specifies a list of attributes
 947         which should be appended to the attribute list given for each
 948         element. The effect is to make every single element in the abstract
 949         syntax searchable by way of the given attributes. This directive
 950         provides an efficient way of supporting free-text searching across all
 951         elements. However, it does increase the size of the index
 952         significantly. The attributes can be qualified with a structure, as in
 953         the <emphasis>elm</emphasis> directive below.
 954        </para>
 955       </listitem></varlistentry>
 956      <varlistentry>
 957       <term>elm <emphasis>path name attributes</emphasis></term>
 958       <listitem>
 959        <para>
 960         (o,r) Adds an element to the abstract record syntax of the schema.
 961         The <emphasis>path</emphasis> follows the
 962         syntax which is suggested by the Z39.50 document - that is, a sequence
 963         of tags separated by slashes (/). Each tag is given as a
 964         comma-separated pair of tag type and -value surrounded by parenthesis.
 965         The <emphasis>name</emphasis> is the name of the element, and
 966         the <emphasis>attributes</emphasis>
 967         specifies which attributes to use when indexing the element in a
 968         comma-separated list.
 969         A ! in place of the attribute name is equivalent to
 970         specifying an attribute name identical to the element name.
 971         A - in place of the attribute name
 972         specifies that no indexing is to take place for the given element.
 973         The attributes can be qualified with <emphasis>field
 974          types</emphasis> to specify which
 975         character set should govern the indexing procedure for that field.
 976         The same data element may be indexed into several different
 977         fields, using different character set definitions.
 978         See the <xref linkend="field-structure-and-character-sets"/>.
 979          The default field type is "w" for <emphasis>word</emphasis>.
 980        </para>
 981       </listitem></varlistentry>
 982     </variablelist>
 983    </para>
 984
 985    <note>
 986    <para>
 987      The mechanism for controlling indexing is not adequate for
 988      complex databases, and will probably be moved into a separate
 989      configuration table eventually.
 990     </para>
 991    </note>
 992
 993    <para>
 994     The following is an excerpt from the abstract syntax file for the GILS
 995     profile.
 996    </para>
 997
 998    <para>
 999
1000     <screen>
1001      name gils
1002      reference GILS-schema
1003      attset gils.att
1004      tagset gils.tag
1005      varset var1.var
1006
1007      maptab gils-usmarc.map
1008
1009      # Element set names
1010
1011      esetname VARIANT gils-variant.est  # for WAIS-compliance
1012      esetname B gils-b.est
1013      esetname G gils-g.est
1014      esetname F @
1015
1016      elm (1,10)              rank                        -
1017      elm (1,12)              url                         -
1018      elm (1,14)              localControlNumber     Local-number
1019      elm (1,16)              dateOfLastModification Date/time-last-modified
1020      elm (2,1)               title                       w:!,p:!
1021      elm (4,1)               controlIdentifier      Identifier-standard
1022      elm (2,6)               abstract               Abstract
1023      elm (4,51)              purpose                     !
1024      elm (4,52)              originator                  -
1025      elm (4,53)              accessConstraints           !
1026      elm (4,54)              useConstraints              !
1027      elm (4,70)              availability                -
1028      elm (4,70)/(4,90)       distributor                 -
1029      elm (4,70)/(4,90)/(2,7) distributorName             !
1030      elm (4,70)/(4,90)/(2,10 distributorOrganization     !
1031      elm (4,70)/(4,90)/(4,2) distributorStreetAddress    !
1032      elm (4,70)/(4,90)/(4,3) distributorCity             !
1033     </screen>
1034
1035    </para>
1036
1037   </sect2>
1038
1039   <sect2 id="attset-files">
1040    <title>The Attribute Set (.att) Files</title>
1041
1042    <para>
1043     This file type describes the <emphasis>Use</emphasis> elements of
1044     an attribute set.
1045     It contains the following directives.
1046    </para>
1047
1048    <para>
1049     <variablelist>
1050      <varlistentry>
1051       <term>name <emphasis>symbolic-name</emphasis></term>
1052       <listitem>
1053        <para>
1054         (m) This provides a shorthand name or
1055         description for the attribute set.
1056         Mostly useful for diagnostic purposes.
1057        </para>
1058       </listitem></varlistentry>
1059      <varlistentry>
1060       <term>reference <emphasis>OID-name</emphasis></term>
1061       <listitem>
1062        <para>
1063         (m) The reference name of the OID for
1064         the attribute set.
1065         The reference names can be found in the <emphasis>util</emphasis>
1066         module of <emphasis>YAZ</emphasis>.
1067        </para>
1068       </listitem></varlistentry>
1069      <varlistentry>
1070       <term>include <emphasis>filename</emphasis></term>
1071       <listitem>
1072        <para>
1073         (o,r) This directive is used to
1074         include another attribute set as a part of the current one. This is
1075         used when a new attribute set is defined as an extension to another
1076         set. For instance, many new attribute sets are defined as extensions
1077         to the <emphasis>bib-1</emphasis> set.
1078         This is an important feature of the retrieval
1079         system of Z39.50, as it ensures the highest possible level of
1080         interoperability, as those access points of your database which are
1081         derived from the external set (say, bib-1) can be used even by clients
1082         who are unaware of the new set.
1083        </para>
1084       </listitem></varlistentry>
1085      <varlistentry>
1086       <term>att
1087        <emphasis>att-value att-name &lsqb;local-value&rsqb;</emphasis></term>
1088       <listitem>
1089        <para>
1090         (o,r) This
1091         repeatable directive introduces a new attribute to the set. The
1092         attribute value is stored in the index (unless a
1093         <emphasis>local-value</emphasis> is
1094         given, in which case this is stored). The name is used to refer to the
1095         attribute from the <emphasis>abstract syntax</emphasis>.
1096        </para>
1097       </listitem></varlistentry>
1098     </variablelist>
1099    </para>
1100
1101    <para>
1102     This is an excerpt from the GILS attribute set definition.
1103     Notice how the file describing the <emphasis>bib-1</emphasis>
1104     attribute set is referenced.
1105    </para>
1106
1107    <para>
1108
1109     <screen>
1110      name gils
1111      reference GILS-attset
1112      include bib1.att
1113
1114      att 2001           distributorName
1115      att 2002           indextermsControlled
1116      att 2003           purpose
1117      att 2004           accessConstraints
1118      att 2005           useConstraints
1119     </screen>
1120
1121    </para>
1122
1123   </sect2>
1124
1125   <sect2>
1126    <title>The Tag Set (.tag) Files</title>
1127
1128    <para>
1129     This file type defines the tagset of the profile, possibly by
1130     referencing other tag sets (most tag sets, for instance, will include
1131     tagsetG and tagsetM from the Z39.50 specification. The file may
1132     contain the following directives.
1133    </para>
1134
1135    <para>
1136     <variablelist>
1137
1138      <varlistentry>
1139       <term>name <emphasis>symbolic-name</emphasis></term>
1140       <listitem>
1141        <para>
1142         (m) This provides a shorthand name or
1143         description for the tag set. Mostly useful for diagnostic purposes.
1144        </para>
1145       </listitem></varlistentry>
1146      <varlistentry>
1147       <term>reference <emphasis>OID-name</emphasis></term>
1148       <listitem>
1149        <para>
1150         (o) The reference name of the OID for the tag set.
1151         The reference names can be found in the <emphasis>util</emphasis>
1152         module of <emphasis>YAZ</emphasis>.
1153         The directive is optional, since not all tag sets
1154         are registered outside of their schema.
1155        </para>
1156       </listitem></varlistentry>
1157      <varlistentry>
1158       <term>type <emphasis>integer</emphasis></term>
1159       <listitem>
1160        <para>
1161         (m) The type number of the tagset within the schema
1162         profile (note: this specification really should belong to the .abs
1163         file. This will be fixed in a future release).
1164        </para>
1165       </listitem></varlistentry>
1166      <varlistentry>
1167       <term>include <emphasis>filename</emphasis></term>
1168       <listitem>
1169        <para>
1170         (o,r) This directive is used
1171         to include the definitions of other tag sets into the current one.
1172        </para>
1173       </listitem></varlistentry>
1174      <varlistentry>
1175       <term>tag <emphasis>number names type</emphasis></term>
1176       <listitem>
1177        <para>
1178         (o,r) Introduces a new tag to the set.
1179         The <emphasis>number</emphasis> is the tag number as used
1180         in the protocol (there is currently no mechanism for
1181         specifying string tags at this point, but this would be quick
1182         work to add).
1183         The <emphasis>names</emphasis> parameter is a list of names
1184         by which the tag should be recognized in the input file format.
1185         The names should be separated by slashes (/).
1186         The <emphasis>type</emphasis> is th recommended datatype of
1187         the tag.
1188         It should be one of the following:
1189
1190         <itemizedlist>
1191          <listitem>
1192           <para>
1193            structured
1194           </para>
1195          </listitem>
1196
1197          <listitem>
1198           <para>
1199            string
1200           </para>
1201          </listitem>
1202
1203          <listitem>
1204           <para>
1205            numeric
1206           </para>
1207          </listitem>
1208
1209          <listitem>
1210           <para>
1211            bool
1212           </para>
1213          </listitem>
1214
1215          <listitem>
1216           <para>
1217            oid
1218           </para>
1219          </listitem>
1220
1221          <listitem>
1222           <para>
1223            generalizedtime
1224           </para>
1225          </listitem>
1226
1227          <listitem>
1228           <para>
1229            intunit
1230           </para>
1231          </listitem>
1232
1233          <listitem>
1234           <para>
1235            int
1236           </para>
1237          </listitem>
1238
1239          <listitem>
1240           <para>
1241            octetstring
1242           </para>
1243          </listitem>
1244
1245          <listitem>
1246           <para>
1247            null
1248           </para>
1249          </listitem>
1250
1251         </itemizedlist>
1252
1253        </para>
1254       </listitem></varlistentry>
1255     </variablelist>
1256    </para>
1257
1258    <para>
1259     The following is an excerpt from the TagsetG definition file.
1260    </para>
1261
1262    <para>
1263     <screen>
1264      name tagsetg
1265      reference TagsetG
1266      type 2
1267
1268      tag        1       title           string
1269      tag        2       author          string
1270      tag        3       publicationPlace string
1271      tag        4       publicationDate string
1272      tag        5       documentId      string
1273      tag        6       abstract        string
1274      tag        7       name            string
1275      tag        8       date            generalizedtime
1276      tag        9       bodyOfDisplay   string
1277      tag        10      organization    string
1278     </screen>
1279    </para>
1280
1281   </sect2>
1282
1283   <sect2 id="variant-set">
1284    <title>The Variant Set (.var) Files</title>
1285
1286    <para>
1287     The variant set file is a straightforward representation of the
1288     variant set definitions associated with the protocol. At present, only
1289     the <emphasis>Variant-1</emphasis> set is known.
1290    </para>
1291
1292    <para>
1293     These are the directives allowed in the file.
1294    </para>
1295
1296    <para>
1297     <variablelist>
1298
1299      <varlistentry>
1300       <term>name <emphasis>symbolic-name</emphasis></term>
1301       <listitem>
1302        <para>
1303         (m) This provides a shorthand name or
1304         description for the variant set. Mostly useful for diagnostic purposes.
1305        </para>
1306       </listitem></varlistentry>
1307      <varlistentry>
1308       <term>reference <emphasis>OID-name</emphasis></term>
1309       <listitem>
1310        <para>
1311         (o) The reference name of the OID for
1312         the variant set, if one is required. The reference names can be found
1313         in the <emphasis>util</emphasis> module of <emphasis>YAZ</emphasis>.
1314        </para>
1315       </listitem></varlistentry>
1316      <varlistentry>
1317       <term>class <emphasis>integer class-name</emphasis></term>
1318       <listitem>
1319        <para>
1320         (m,r) Introduces a new
1321         class to the variant set.
1322        </para>
1323       </listitem></varlistentry>
1324      <varlistentry>
1325       <term>type <emphasis>integer type-name datatype</emphasis></term>
1326       <listitem>
1327        <para>
1328         (m,r) Addes a
1329         new type to the current class (the one introduced by the most recent
1330         <emphasis>class</emphasis> directive).
1331         The type names belong to the same name space as the one used
1332         in the tag set definition file.
1333        </para>
1334       </listitem></varlistentry>
1335     </variablelist>
1336    </para>
1337
1338    <para>
1339     The following is an excerpt from the file describing the variant set
1340     <emphasis>Variant-1</emphasis>.
1341    </para>
1342
1343    <para>
1344
1345     <screen>
1346      name variant-1
1347      reference Variant-1
1348
1349      class 1 variantId
1350
1351      type       1       variantId               octetstring
1352
1353      class 2 body
1354
1355      type       1       iana                    string
1356      type       2       z39.50                  string
1357      type       3       other                   string
1358     </screen>
1359
1360    </para>
1361
1362   </sect2>
1363
1364   <sect2>
1365    <title>The Element Set (.est) Files</title>
1366
1367    <para>
1368     The element set specification files describe a selection of a subset
1369     of the elements of a database record. The element selection mechanism
1370     is equivalent to the one supplied by the <emphasis>Espec-1</emphasis>
1371     syntax of the Z39.50 specification.
1372     In fact, the internal representation of an element set
1373     specification is identical to the <emphasis>Espec-1</emphasis> structure,
1374     and we'll refer you to the description of that structure for most of
1375     the detailed semantics of the directives below.
1376    </para>
1377
1378    <note>
1379     <para>
1380      Not all of the Espec-1 functionality has been implemented yet.
1381      The fields that are mentioned below all work as expected, unless
1382      otherwise is noted.
1383     </para>
1384    </note>
1385
1386    <para>
1387     The directives available in the element set file are as follows:
1388    </para>
1389
1390    <para>
1391     <variablelist>
1392      <varlistentry>
1393       <term>defaultVariantSetId <emphasis>OID-name</emphasis></term>
1394       <listitem>
1395        <para>
1396         (o) If variants are used in
1397         the following, this should provide the name of the variantset used
1398         (it's not currently possible to specify a different set in the
1399         individual variant request). In almost all cases (certainly all
1400         profiles known to us), the name
1401         <literal>Variant-1</literal> should be given here.
1402        </para>
1403       </listitem></varlistentry>
1404      <varlistentry>
1405       <term>defaultVariantRequest <emphasis>variant-request</emphasis></term>
1406       <listitem>
1407        <para>
1408         (o) This directive
1409         provides a default variant request for
1410         use when the individual element requests (see below) do not contain a
1411         variant request. Variant requests consist of a blank-separated list of
1412         variant components. A variant compont is a comma-separated,
1413         parenthesized triple of variant class, type, and value (the two former
1414         values being represented as integers). The value can currently only be
1415         entered as a string (this will change to depend on the definition of
1416         the variant in question). The special value (@) is interpreted as a
1417         null value, however.
1418        </para>
1419       </listitem></varlistentry>
1420      <varlistentry>
1421       <term>simpleElement
1422        <emphasis>path &lsqb;'variant' variant-request&rsqb;</emphasis></term>
1423       <listitem>
1424        <para>
1425         (o,r) This corresponds to a simple element request
1426         in <emphasis>Espec-1</emphasis>.
1427         The path consists of a sequence of tag-selectors, where each of
1428         these can consist of either:
1429        </para>
1430
1431        <para>
1432         <itemizedlist>
1433          <listitem>
1434           <para>
1435            A simple tag, consisting of a comma-separated type-value pair in
1436            parenthesis, possibly followed by a colon (:) followed by an
1437            occurrences-specification (see below). The tag-value can be a number
1438            or a string. If the first character is an apostrophe ('), this
1439            forces the value to be interpreted as a string, even if it
1440            appears to be numerical.
1441           </para>
1442          </listitem>
1443
1444          <listitem>
1445           <para>
1446            A WildThing, represented as a question mark (?), possibly
1447            followed by a colon (:) followed by an occurrences
1448            specification (see below).
1449           </para>
1450          </listitem>
1451
1452          <listitem>
1453           <para>
1454            A WildPath, represented as an asterisk (*). Note that the last
1455            element of the path should not be a wildPath (wildpaths don't
1456            work in this version).
1457           </para>
1458          </listitem>
1459
1460         </itemizedlist>
1461
1462        </para>
1463
1464        <para>
1465         The occurrences-specification can be either the string
1466         <literal>all</literal>, the string <literal>last</literal>, or
1467         an explicit value-range. The value-range is represented as
1468         an integer (the starting point), possibly followed by a
1469         plus (+) and a second integer (the number of elements, default
1470         being one).
1471        </para>
1472
1473        <para>
1474         The variant-request has the same syntax as the defaultVariantRequest
1475         above. Note that it may sometimes be useful to give an empty variant
1476         request, simply to disable the default for a specific set of fields
1477         (we aren't certain if this is proper <emphasis>Espec-1</emphasis>,
1478         but it works in this implementation).
1479        </para>
1480       </listitem></varlistentry>
1481     </variablelist>
1482    </para>
1483
1484    <para>
1485     The following is an example of an element specification belonging to
1486     the GILS profile.
1487    </para>
1488
1489    <para>
1490
1491     <screen>
1492      simpleelement (1,10)
1493      simpleelement (1,12)
1494      simpleelement (2,1)
1495      simpleelement (1,14)
1496      simpleelement (4,1)
1497      simpleelement (4,52)
1498     </screen>
1499
1500    </para>
1501
1502   </sect2>
1503
1504   <sect2 id="schema-mapping">
1505    <title>The Schema Mapping (.map) Files</title>
1506
1507    <para>
1508     Sometimes, the client might want to receive a database record in
1509     a schema that differs from the native schema of the record. For
1510     instance, a client might only know how to process WAIS records, while
1511     the database record is represented in a more specific schema, such as
1512     GILS. In this module, a mapping of data to one of the MARC formats is
1513     also thought of as a schema mapping (mapping the elements of the
1514     record into fields consistent with the given MARC specification, prior
1515     to actually converting the data to the ISO2709). This use of the
1516     object identifier for USMARC as a schema identifier represents an
1517     overloading of the OID which might not be entirely proper. However,
1518     it represents the dual role of schema and record syntax which
1519     is assumed by the MARC family in Z39.50.
1520    </para>
1521
1522    <para>
1523     <emphasis>NOTE: The schema-mapping functions are so far limited to a
1524      straightforward mapping of elements. This should be extended with
1525      mechanisms for conversions of the element contents, and conditional
1526      mappings of elements based on the record contents.</emphasis>
1527    </para>
1528
1529    <para>
1530     These are the directives of the schema mapping file format:
1531    </para>
1532
1533    <para>
1534     <variablelist>
1535
1536      <varlistentry>
1537       <term>targetName <emphasis>name</emphasis></term>
1538       <listitem>
1539        <para>
1540         (m) A symbolic name for the target schema
1541         of the table. Useful mostly for diagnostic purposes.
1542        </para>
1543       </listitem></varlistentry>
1544      <varlistentry>
1545       <term>targetRef <emphasis>OID-name</emphasis></term>
1546       <listitem>
1547        <para>
1548         (m) An OID name for the target schema.
1549         This is used, for instance, by a server receiving a request to present
1550         a record in a different schema from the native one.
1551         The name, again, is found in the <emphasis>oid</emphasis>
1552         module of <emphasis>YAZ</emphasis>.
1553        </para>
1554       </listitem></varlistentry>
1555      <varlistentry>
1556       <term>map <emphasis>element-name target-path</emphasis></term>
1557       <listitem>
1558        <para>
1559         (o,r) Adds
1560         an element mapping rule to the table.
1561        </para>
1562       </listitem></varlistentry>
1563     </variablelist>
1564    </para>
1565
1566   </sect2>
1567
1568   <sect2>
1569    <title>The MARC (ISO2709) Representation (.mar) Files</title>
1570
1571    <para>
1572     This file provides rules for representing a record in the ISO2709
1573     format. The rules pertain mostly to the values of the constant-length
1574     header of the record.
1575    </para>
1576
1577    <para>
1578     <emphasis>NOTE: This will be described better. We're in the process of
1579      re-evaluating and most likely changing the way that MARC records are
1580      handled by the system.</emphasis>
1581    </para>
1582
1583   </sect2>
1584
1585   <sect2 id="field-structure-and-character-sets">
1586    <title>Field Structure and Character Sets
1587    </title>
1588
1589    <para>
1590     In order to provide a flexible approach to national character set
1591     handling, Zebra allows the administrator to configure the set up the
1592     system to handle any 8-bit character set &mdash; including sets that
1593     require multi-octet diacritics or other multi-octet characters. The
1594     definition of a character set includes a specification of the
1595     permissible values, their sort order (this affects the display in the
1596     SCAN function), and relationships between upper- and lowercase
1597     characters. Finally, the definition includes the specification of
1598     space characters for the set.
1599    </para>
1600
1601    <para>
1602     The operator can define different character sets for different fields,
1603     typical examples being standard text fields, numerical fields, and
1604     special-purpose fields such as WWW-style linkages (URx).
1605    </para>
1606
1607    <para>
1608     The field types, and hence character sets, are associated with data
1609     elements by the .abs files (see above).
1610     The file <literal>default.idx</literal>
1611     provides the association between field type codes (as used in the .abs
1612     files) and the character map files (with the .chr suffix). The format
1613     of the .idx file is as follows
1614    </para>
1615
1616    <para>
1617     <variablelist>
1618
1619      <varlistentry>
1620       <term>index <emphasis>field type code</emphasis></term>
1621       <listitem>
1622        <para>
1623         This directive introduces a new search index code.
1624         The argument is a one-character code to be used in the
1625         .abs files to select this particular index type. An index, roughly,
1626         corresponds to a particular structure attribute during search. Refer
1627         to <xref linkend="search"/>.
1628        </para>
1629       </listitem></varlistentry>
1630      <varlistentry>
1631       <term>sort <emphasis>field code type</emphasis></term>
1632       <listitem>
1633        <para>
1634         This directive introduces a
1635         sort index. The argument is a one-character code to be used in the
1636         .abs fie to select this particular index type. The corresponding
1637         use attribute must be used in the sort request to refer to this
1638         particular sort index. The corresponding character map (see below)
1639         is used in the sort process.
1640        </para>
1641       </listitem></varlistentry>
1642      <varlistentry>
1643       <term>completeness <emphasis>boolean</emphasis></term>
1644       <listitem>
1645        <para>
1646         This directive enables or disables complete field indexing.
1647         The value of the <emphasis>boolean</emphasis> should be 0
1648         (disable) or 1. If completeness is enabled, the index entry will
1649         contain the complete contents of the field (up to a limit), with words
1650         (non-space characters) separated by single space characters
1651         (normalized to " " on display). When completeness is
1652         disabled, each word is indexed as a separate entry. Complete subfield
1653         indexing is most useful for fields which are typically browsed (eg.
1654         titles, authors, or subjects), or instances where a match on a
1655         complete subfield is essential (eg. exact title searching). For fields
1656         where completeness is disabled, the search engine will interpret a
1657         search containing space characters as a word proximity search.
1658        </para>
1659       </listitem></varlistentry>
1660      <varlistentry>
1661       <term>charmap <emphasis>filename</emphasis></term>
1662       <listitem>
1663        <para>
1664         This is the filename of the character
1665         map to be used for this index for field type.
1666        </para>
1667       </listitem></varlistentry>
1668     </variablelist>
1669    </para>
1670
1671    <para>
1672     The contents of the character map files are structured as follows:
1673    </para>
1674
1675    <para>
1676     <variablelist>
1677
1678      <varlistentry>
1679       <term>lowercase <emphasis>value-set</emphasis></term>
1680       <listitem>
1681        <para>
1682         This directive introduces the basic value set of the field type.
1683         The format is an ordered list (without spaces) of the
1684         characters which may occur in "words" of the given type.
1685         The order of the entries in the list determines the
1686         sort order of the index. In addition to single characters, the
1687         following combinations are legal:
1688        </para>
1689
1690        <para>
1691
1692         <itemizedlist>
1693          <listitem>
1694           <para>
1695            Backslashes may be used to introduce three-digit octal, or
1696            two-digit hex representations of single characters
1697            (preceded by <literal>x</literal>).
1698            In addition, the combinations
1699            \\, \\r, \\n, \\t, \\s (space &mdash; remember that real
1700            space-characters may ot occur in the value definition), and
1701            \\ are recognised, with their usual interpretation.
1702           </para>
1703          </listitem>
1704
1705          <listitem>
1706           <para>
1707            Curly braces &lcub;&rcub; may be used to enclose ranges of single
1708            characters (possibly using the escape convention described in the
1709            preceding point), eg. &lcub;a-z&rcub; to entroduce the
1710            standard range of ASCII characters.
1711            Note that the interpretation of such a range depends on
1712            the concrete representation in your local, physical character set.
1713           </para>
1714          </listitem>
1715
1716          <listitem>
1717           <para>
1718            paranthesises () may be used to enclose multi-byte characters -
1719            eg. diacritics or special national combinations (eg. Spanish
1720            "ll"). When found in the input stream (or a search term),
1721            these characters are viewed and sorted as a single character, with a
1722            sorting value depending on the position of the group in the value
1723            statement.
1724           </para>
1725          </listitem>
1726
1727         </itemizedlist>
1728
1729        </para>
1730       </listitem></varlistentry>
1731      <varlistentry>
1732       <term>uppercase <emphasis>value-set</emphasis></term>
1733       <listitem>
1734        <para>
1735         This directive introduces the
1736         upper-case equivalencis to the value set (if any). The number and
1737         order of the entries in the list should be the same as in the
1738         <literal>lowercase</literal> directive.
1739        </para>
1740       </listitem></varlistentry>
1741      <varlistentry>
1742       <term>space <emphasis>value-set</emphasis></term>
1743       <listitem>
1744        <para>
1745         This directive introduces the character
1746         which separate words in the input stream. Depending on the
1747         completeness mode of the field in question, these characters either
1748         terminate an index entry, or delimit individual "words" in
1749         the input stream. The order of the elements is not significant &mdash;
1750         otherwise the representation is the same as for the
1751         <literal>uppercase</literal> and <literal>lowercase</literal>
1752         directives.
1753        </para>
1754       </listitem></varlistentry>
1755      <varlistentry>
1756       <term>map <emphasis>value-set</emphasis>
1757        <emphasis>target</emphasis></term>
1758       <listitem>
1759        <para>
1760         This directive introduces a
1761         mapping between each of the members of the value-set on the left to
1762         the character on the right. The character on the right must occur in
1763         the value set (the <literal>lowercase</literal> directive) of
1764         the character set, but
1765         it may be a paranthesis-enclosed multi-octet character. This directive
1766         may be used to map diacritics to their base characters, or to map
1767         HTML-style character-representations to their natural form, etc.
1768        </para>
1769       </listitem></varlistentry>
1770     </variablelist>
1771    </para>
1772
1773   </sect2>
1774
1775  </sect1>
1776
1777  <sect1 id="formats">
1778   <title>Exchange Formats</title>
1779
1780   <para>
1781    Converting records from the internal structure to en exchange format
1782    is largely an automatic process. Currently, the following exchange
1783    formats are supported:
1784   </para>
1785
1786   <para>
1787    <itemizedlist>
1788     <listitem>
1789      <para>
1790       GRS-1. The internal representation is based on GRS-1/XML, so the
1791       conversion here is straightforward. The system will create
1792       applied variant and supported variant lists as required, if a record
1793       contains variant information.
1794      </para>
1795     </listitem>
1796
1797     <listitem>
1798      <para>
1799       XML. The internal representation is based on GRS-1/XML so
1800       the mapping is trivial. Note that XML schemas, preprocessing
1801       instructions and comments are not part of the internal representation
1802       and therefore will never be part of a generated XML record.
1803       Future versions of the Zebra will support that.
1804      </para>
1805     </listitem>
1806
1807     <listitem>
1808      <para>
1809       SUTRS. Again, the mapping is fairly straighforward. Indentation
1810       is used to show the hierarchical structure of the record. All
1811       "GRS" type records support both the GRS-1 and SUTRS
1812       representations.
1813      </para>
1814     </listitem>
1815
1816     <listitem>
1817      <para>
1818       ISO2709-based formats (USMARC, etc.). Only records with a
1819       two-level structure (corresponding to fields and subfields) can be
1820       directly mapped to ISO2709. For records with a different structuring
1821       (eg., GILS), the representation in a structure like USMARC involves a
1822       schema-mapping (see <xref linkend="schema-mapping"/>), to an
1823        "implied" USMARC schema (implied,
1824        because there is no formal schema which specifies the use of the
1825        USMARC fields outside of ISO2709). The resultant, two-level record is
1826        then mapped directly from the internal representation to ISO2709. See
1827        the GILS schema definition files for a detailed example of this
1828        approach.
1829      </para>
1830     </listitem>
1831
1832     <listitem>
1833      <para>
1834       Explain. This representation is only available for records
1835       belonging to the Explain schema.
1836      </para>
1837     </listitem>
1838
1839     <listitem>
1840      <para>
1841       Summary. This ASN-1 based structure is only available for records
1842       belonging to the Summary schema - or schema which provide a mapping
1843       to this schema (see the description of the schema mapping facility
1844       above).
1845      </para>
1846     </listitem>
1847
1848     <listitem>
1849      <para>
1850       SOIF. Support for this syntax is experimental, and is currently
1851       keyed to a private Index Data OID (1.2.840.10003.5.1000.81.2). All
1852       abstract syntaxes can be mapped to the SOIF format, although nested
1853       elements are represented by concatenation of the tag names at each
1854       level.
1855      </para>
1856     </listitem>
1857
1858    </itemizedlist>
1859   </para>
1860  </sect1>
1861
1862 </chapter>
1863  <!-- Keep this comment at the end of the file
1864  Local variables:
1865  mode: sgml
1866  sgml-omittag:t
1867  sgml-shorttag:t
1868  sgml-minimize-attributes:nil
1869  sgml-always-quote-attributes:t
1870  sgml-indent-step:1
1871  sgml-indent-data:t
1872  sgml-parent-document: "zebra.xml"
1873  sgml-local-catalogs: nil
1874  sgml-namecase-general:t
1875  End:
1876  -->