1 <chapter id="querymodel">
2 <!-- $Id: querymodel.xml,v 1.1 2006-06-13 09:27:01 marc Exp $ -->
3 <title>Query Model</title>
5 <sect1 id="querymodel-overview">
6 <title>Query Model Overview</title>
9 Zebra is born as a networking Information Retrieval engine adhering
10 to the international standards
11 <ulink url="http://www.loc.gov/z3950/agency/">Z39.50</ulink> and
12 <ulink url="http://www.loc.gov/standards/sru/">SRU</ulink>,
13 and implement the query model defined there.
14 Unfortunately, the Z39.50 query model has only defined a binary
15 encoded representation, which is used as transport packaging in
16 the Z39.50 protocol layer. This representation is not human
17 readable, nor defines any convenient way to specify queries.
20 Therefore, Index Data has defined a textual representaion in the
21 <literal>Prefix Query Format</literal>, short
22 <literal>PQF</literal>, which then has been adopted by other
23 parties developing Z39.50 software. It is also often referred to as
24 <literal>Prefix Query Notation</literal>, or in short
25 <literal>PQN</literal>, and is thoroughly explained in
26 <xref linkend="querymodel-pqf"/>.
30 In addition, Zebra can be configured to understand and map the
31 <literal>Common Query Language</literal>
32 (<ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>)
33 to PQF. See an introduction on the mapping to the internal query
35 <xref linkend="querymodel-cql-to-pqf"/>.
39 <sect1 id="querymodel-pqf">
40 <title>Prefix Query Format structure and syntax</title>
43 <ulink url="http://indexdata.dk/yaz/doc/tools.tkl#PQF">PQF
44 grammer</ulink> is documented in the YAZ manual.
45 This textual PQF representation
46 is always during search mapped to the equivalent Zebra internal
53 <sect2 id="querymodel-exp1">
54 <title>Explain Attribute Set</title>
56 The attribute-set <literal>exp-1</literal> is defined for
57 searching an Explain <literal>IR-Explain-1</literal> database.
58 It consists of a single <literal>Use (type 1)</literal> attribute.
61 In addition, the non-Use
62 <literal>bib-1</literal> attributes, that is, the types
63 <literal>Relation</literal>, <literal>Position</literal>,
64 <literal>Structure</literal>, <literal>Truncation</literal>,
65 and <literal>Completeness</literal> are imported from
66 the <literal>bib-1</literal> attrubute set, and may be used
67 within any explain query.
70 <sect3 id="querymodel-exp1-use">
71 <title>Use Attributes (type = 1)</title>
73 The following Explain search atributes are supported:
74 <literal>ExplainCategory</literal> (@attr 1=1),
75 <literal>DatabaseName</literal> (@attr 1=3),
76 <literal>DateAdded</literal> (@attr 1=9),
77 <literal>DateChanged</literal>(@attr 1=10).
80 A search in the use attribute <literal>ExplainCategory</literal>
81 supports only these predefined values:
82 <literal>CategoryList</literal>, <literal>TargetInfo</literal>,
83 <literal>DatabaseInfo</literal>, <literal>AttributeDetails</literal>.
86 See <filename>tab/explain.att</filename> and the
92 <title>Explain searches with yaz-client</title>
94 List supported categories to find out which explain commands are
98 Z> @attr exp1 1=1 categorylist
105 Get target info, that is, investigate which databases exist at
106 this server endpoint:
109 Z> @attr exp1 1=1 targetinfo
120 List all supported databases, the number of hits
121 is the number of databases found, which most commonly are the
123 the <literal>Default</literal> and the
124 <literal>IR-Explain-1</literal> databases.
127 Z> f @attr exp1 1=1 databaseinfo
134 Get database info record for database <literal>Default</literal>.
137 Z> @and @attr exp1 1=1 databaseinfo @attr exp1 1=3 Default
139 Identical query with explicitly specified attribute set:
142 Z> @attrset exp1 @and @attr 1=1 databaseinfo @attr 1=3 Default
147 Get attribute details record for database
148 <literal>Default</literal>.
149 This query is very useful to study the internal Zebra indexes.
150 If records have been indexed using the <literal>alvis</literal>
151 XSLT filter, the string representation names of the known indexes can be
155 Z> @and @attr exp1 1=1 attributedetails @attr exp1 1=3 Default
157 Identical query with explicitly specified attribute set:
160 Z> @attrset exp1 @and @attr 1=1 attributedetails @attr 1=3 Default
167 <sect2 id="querymodel-bib1">
168 <title>Bib1 Attribute Set</title>
170 Something about querying to be written ..
173 Most of the information contained in this section is an excerpt of
174 the <literal>ATTRIBUTE SET BIB-1 (Z39.50-1995)
175 SEMANTICS</literal>, found at <ulink
176 url="http://www.loc.gov/z3950/agency/bib1.html">The BIB-1
177 Attribute Set Semantics</ulink> from 1995, also in an updated
179 url="http://www.loc.gov/z3950/agency/defns/bib1.html">Bib-1
180 Attribute Set</ulink>
181 version from 2003. Index Data is not the copyright holder of this
186 <sect3 id="querymodel-bib1-use">
187 <title>Use Attributes (type = 1)</title>
190 <sect3 id="querymodel-bib1-relation">
191 <title>Relation Attributes (type = 2)</title>
196 <sect3 id="querymodel-bib1-position">
197 <title>Position Attributes (type = 3)</title>
200 <sect3 id="querymodel-bib1-structure">
201 <title>Structure Attributes (type = 4)</title>
204 <sect3 id="querymodel-bib1-truncation">
205 <title>Truncation Attributes (type = 5)</title>
208 <sect3 id="querymodel-bib1-completeness">
209 <title>Completeness Attributes (type = 6)</title>
212 <sect3 id="querymodel-bib1-sorting">
213 <title>Zebra Extention Sorting Attributes (type = 7)</title>
216 <sect3 id="querymodel-bib1-estimation">
217 <title>Zebra Extention Search Estimation Attributes (type = 8)</title>
220 <sect3 id="querymodel-bib1-weight">
221 <title>Zebra Extention Weight Attributes (type = 9)</title>
226 <sect2 id="querymodel-bib1-mapping">
227 <title>Mapping from Bib1 Attributes to Zebra internal
228 register indexes</title>
233 <emphasis>Use</emphasis> attributes are interpreted according to the
234 attribute sets which have been loaded in the
235 <literal>zebra.cfg</literal> file, and are matched against specific
236 fields as specified in the <literal>.abs</literal> file which
237 describes the profile of the records which have been loaded.
238 If no Use attribute is provided, a default of Bib-1 Any is assumed.
242 If a <emphasis>Structure</emphasis> attribute of
243 <emphasis>Phrase</emphasis> is used in conjunction with a
244 <emphasis>Completeness</emphasis> attribute of
245 <emphasis>Complete (Sub)field</emphasis>, the term is matched
246 against the contents of the phrase (long word) register, if one
247 exists for the given <emphasis>Use</emphasis> attribute.
248 A phrase register is created for those fields in the
249 <literal>.abs</literal> file that contains a
250 <literal>p</literal>-specifier.
251 <!-- ### whatever the hell _that_ is -->
255 If <emphasis>Structure</emphasis>=<emphasis>Phrase</emphasis> is
256 used in conjunction with <emphasis>Incomplete Field</emphasis> - the
257 default value for <emphasis>Completeness</emphasis>, the
258 search is directed against the normal word registers, but if the term
259 contains multiple words, the term will only match if all of the words
260 are found immediately adjacent, and in the given order.
261 The word search is performed on those fields that are indexed as
262 type <literal>w</literal> in the <literal>.abs</literal> file.
266 If the <emphasis>Structure</emphasis> attribute is
267 <emphasis>Word List</emphasis>,
268 <emphasis>Free-form Text</emphasis>, or
269 <emphasis>Document Text</emphasis>, the term is treated as a
270 natural-language, relevance-ranked query.
271 This search type uses the word register, i.e. those fields
272 that are indexed as type <literal>w</literal> in the
273 <literal>.abs</literal> file.
277 If the <emphasis>Structure</emphasis> attribute is
278 <emphasis>Numeric String</emphasis> the term is treated as an integer.
279 The search is performed on those fields that are indexed
280 as type <literal>n</literal> in the <literal>.abs</literal> file.
284 If the <emphasis>Structure</emphasis> attribute is
285 <emphasis>URx</emphasis> the term is treated as a URX (URL) entity.
286 The search is performed on those fields that are indexed as type
287 <literal>u</literal> in the <literal>.abs</literal> file.
291 If the <emphasis>Structure</emphasis> attribute is
292 <emphasis>Local Number</emphasis> the term is treated as
293 native Zebra Record Identifier.
297 If the <emphasis>Relation</emphasis> attribute is
298 <emphasis>Equals</emphasis> (default), the term is matched
299 in a normal fashion (modulo truncation and processing of
300 individual words, if required).
301 If <emphasis>Relation</emphasis> is <emphasis>Less Than</emphasis>,
302 <emphasis>Less Than or Equal</emphasis>,
303 <emphasis>Greater than</emphasis>, or <emphasis>Greater than or
304 Equal</emphasis>, the term is assumed to be numerical, and a
305 standard regular expression is constructed to match the given
307 If <emphasis>Relation</emphasis> is <emphasis>Relevance</emphasis>,
308 the standard natural-language query processor is invoked.
312 For the <emphasis>Truncation</emphasis> attribute,
313 <emphasis>No Truncation</emphasis> is the default.
314 <emphasis>Left Truncation</emphasis> is not supported.
315 <emphasis>Process # in search term</emphasis> is supported, as is
316 <emphasis>Regxp-1</emphasis>.
317 <emphasis>Regxp-2</emphasis> enables the fault-tolerant (fuzzy)
318 search. As a default, a single error (deletion, insertion,
319 replacement) is accepted when terms are matched against the register
324 <sect2 id="querymodel-regular">
325 <title>Regular expressions</title>
328 Each term in a query is interpreted as a regular expression if
329 the truncation value is either <emphasis>Regxp-1</emphasis> (102)
330 or <emphasis>Regxp-2</emphasis> (103).
331 Both query types follow the same syntax with the operands:
338 Matches the character <emphasis>x</emphasis>.
346 Matches any character.
351 <term><literal>[</literal>..<literal>]</literal></term>
354 Matches the set of characters specified;
355 such as <literal>[abc]</literal> or <literal>[a-c]</literal>.
367 Matches <emphasis>x</emphasis> zero or more times. Priority: high.
375 Matches <emphasis>x</emphasis> one or more times. Priority: high.
383 Matches <emphasis>x</emphasis> zero or once. Priority: high.
391 Matches <emphasis>x</emphasis>, then <emphasis>y</emphasis>.
400 Matches either <emphasis>x</emphasis> or <emphasis>y</emphasis>.
406 The order of evaluation may be changed by using parentheses.
410 If the first character of the <emphasis>Regxp-2</emphasis> query
411 is a plus character (<literal>+</literal>) it marks the
412 beginning of a section with non-standard specifiers.
413 The next plus character marks the end of the section.
414 Currently Zebra only supports one specifier, the error tolerance,
415 which consists one digit.
419 Since the plus operator is normally a suffix operator the addition to
420 the query syntax doesn't violate the syntax for standard regular
426 <sect2 id="querymodel-examples">
427 <title>Query examples</title>
430 Phrase search for <emphasis>information retrieval</emphasis> in
433 @attr 1=4 "information retrieval"
438 Ranked search for the same thing:
440 @attr 1=4 @attr 2=102 "Information retrieval"
445 Phrase search with a regular expression:
447 @attr 1=4 @attr 5=102 "informat.* retrieval"
452 Ranked search with a regular expression:
454 @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"
459 In the GILS schema (<literal>gils.abs</literal>), the
460 west-bounding-coordinate is indexed as type <literal>n</literal>,
461 and is therefore searched by specifying
462 <emphasis>structure</emphasis>=<emphasis>Numeric String</emphasis>.
463 To match all those records with west-bounding-coordinate greater
464 than -114 we use the following query:
466 @attr 4=109 @attr 2=5 @attr gils 1=2038 -114
472 <!-- see in util/zebramap.c
475 if (completeness_value == 2 || completeness_value == 3)
481 *sort_flag =(sort_relation_value > 0) ? 1 : 0;
482 *search_type = "phrase";
483 strcpy(rank_type, "void");
484 if (relation_value == 102)
486 if (weight_value == -1)
488 sprintf(rank_type, "rank,w=%d,u=%d", weight_value, use_value);
490 if (relation_value == 103)
492 *search_type = "always";
500 switch (structure_value)
502 case 6: /* word list */
503 *search_type = "and-list";
505 case 105: /* free-form-text */
506 *search_type = "or-list";
508 case 106: /* document-text */
509 *search_type = "or-list";
514 case 108: /* string */
515 *search_type = "phrase";
517 case 107: /* local-number */
518 *search_type = "local";
521 case 109: /* numeric string */
523 *search_type = "numeric";
527 *search_type = "phrase";
531 *search_type = "phrase";
535 *search_type = "phrase";
539 *search_type = "phrase";
550 The RecordType parameter in the <literal>zebra.cfg</literal> file, or
551 the <literal>-t</literal> option to the indexer tells Zebra how to
552 process input records.
553 Two basic types of processing are available - raw text and structured
554 data. Raw text is just that, and it is selected by providing the
555 argument <emphasis>text</emphasis> to Zebra. Structured records are
556 all handled internally using the basic mechanisms described in the
558 Zebra can read structured records in many different formats.
564 <sect1 id="querymodel-cql-to-pqf">
565 <title>Server Side CQL to PQF Query Translation</title>
568 <literal><cql2rpn>l2rpn.txt</cql2rpn></literal>
570 Hosts option, one can configure
571 the YAZ Frontend CQL-to-PQF
572 converter, specifying the interpretation of various
573 <ulink url="http://www.loc.gov/standards/sru/cql/">CQL</ulink>
574 indexes, relations, etc. in terms of Type-1 query attributes.
575 <!-- The yaz-client config file -->
578 For example, using server-side CQL-to-PQF conversion, one might
579 query a zebra server like this:
582 yaz-client localhost:9999
584 Z> find text=(plant and soil)
587 and - if properly configured - even static relevance ranking can
588 be performed using CQL query syntax:
591 Z> find text = /relevant (plant and soil)
597 By the way, the same configuration can be used to
598 search using client-side CQL-to-PQF conversion:
599 (the only difference is <literal>querytype cql2rpn</literal>
601 <literal>querytype cql</literal>, and the call specifying a local
605 yaz-client -q local/cql2pqf.txt localhost:9999
607 Z> find text=(plant and soil)
613 Exhaustive information can be found in the
614 Section "Specification of CQL to RPN mappings" in the YAZ manual.
615 <ulink url="http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql.map">
616 http://www.indexdata.dk/yaz/doc/tools.tkl#tools.cql.map</ulink>,
617 and shall therefore not be repeated here.
622 <ulink url="http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html">
623 http://www.loc.gov/z3950/agency/zing/cql/dc-indexes.html</ulink>
624 for the Maintenance Agency's work-in-progress mapping of Dublin Core
625 indexes to Attribute Architecture (util, XD and BIB-2)
634 <sect1 id="architecture-querylanguage">
635 <title>Query Languages</title>
639 http://www.loc.gov/z3950/agency/document.html
641 PQF and BIB-1 stuff to be explained
642 <ulink url="http://www.loc.gov/z3950/agency/defns/bib1.html">
643 http://www.loc.gov/z3950/agency/defns/bib1.html</ulink>
645 <ulink url="http://www.loc.gov/z3950/agency/bib1.html">
646 http://www.loc.gov/z3950/agency/bib1.html</ulink>
648 http://www.loc.gov/z3950/agency/markup/13.html
654 These attribute types are recognized regardless of attribute set. Some are recognized for search, others for scan.
667 The embedded sort is a way to specify sort within a query - thus removing the need to send a Sort Request separately. It is both faster and does not require clients that deal with the Sort Facility.
669 The value after attribute type 7 is 1=ascending, 2=descending.. The attributes+term (APT) node is separate from the rest and must be @or'ed. The term associated with APT is the level .. 0=primary sort, 1=secondary sort etc.. Example:
671 Search for water, sort by title (ascending):
673 @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
675 Search for water, sort by title ascending, then date descending:
677 @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
681 The Term Set feature is a facility that allows a search to store hitting terms in a "pseudo" resultset; thus a search (as usual) + a scan-like facility. Requires a client that can do named result sets since the search generates two result sets. The value for attribute 8 is the name of a result set (string). The terms in term set are returned as SUTRS records.
683 Seach for u in title, right truncated.. Store result in result set named uset.
685 @attr 5=1 @attr 1=4 @attr 8=uset u
687 The model as one serious flaw.. We don't know the size of term set.
691 Rank weight is a way to pass a value to a ranking algorithm - so that one APT has one value - while another as a different one.
693 Search for utah in title with weight 30 as well as any with weight 20.
695 @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
699 Newer Zebra versions normally estemiates hit count for every APT (leaf) in the query tree. These hit counts are returned as part of the searchResult-1 facility.
701 By setting a limit for the APT we can make Zebra turn into approximate hit count when a certain hit count limit is reached. A value of zero means exact hit count.
703 We are intersted in exact hit count for a, but for b we allow estimates for 1000 and higher..
705 @and a @attr 9=1000 b
707 This facility clashes with rank weight! Fortunately this is a Zebra 1.4 thing so we can change this without upsetting anybody!
711 Zebra supports the searchResult-1 facility.
713 If attribute 10 is given, that specifies a subqueryId value returned as part of the search result. It is a way for a client to name an APT part of a query.
718 8 Result set narrow 1.3
723 If attribute 8 is given for scan, the value is the name of a result set. Each hit count in scan is @and'ed with the result set given.
727 The approx (as for search) is a way to enable approx hit counts for scan hit counts. However, it does NOT appear to work at the moment.
730 AdamDickmeiss - 19 Dec 2005
737 <!-- Keep this comment at the end of the file
742 sgml-minimize-attributes:nil
743 sgml-always-quote-attributes:t
746 sgml-parent-document: "zebra.xml"
747 sgml-local-catalogs: nil
748 sgml-namecase-general:t