1 <!-- $Id: book.xml,v 1.8 2006-04-20 09:29:35 mike Exp $ -->
3 <title>Metaproxy - User's Guide and Reference</title>
5 <firstname>Mike</firstname><surname>Taylor</surname>
8 <firstname>Adam</firstname><surname>Dickmeiss</surname>
12 <holder>Index Data</holder>
16 Metaproxy is a universal router, proxy and encapsulated
17 metasearcher for information retrieval protocols. It accepts,
18 processes, interprets and redirects requests from IR clients using
19 standard protocols such as ANSI/NISO Z39.50 (and in the future SRU
20 and SRW), as well as functioning as a limited
21 HTTP server. Metaproxy is configured by an XML file which
22 specifies how the software should function in terms of routes that
23 the request packets can take through the proxy, each step on a
24 route being an instantiation of a filter. Filters come in many
25 types, one for each operation: accepting Z39.50 packets, logging,
26 query transformation, multiplexing, etc. Further filter-types can
27 be added as loadable modules to extend Metaproxy functionality,
31 The terms under which Metaproxy will be distributed have yet to be
32 established, but it will not necessarily be open source; so users
33 should not at this stage redistribute the code without explicit
34 written permission from the copyright holders, Index Data ApS.
39 <chapter id="introduction">
40 <title>Introduction</title>
44 <ulink url="http://indexdata.dk/metaproxy/">Metaproxy</ulink>
45 is a standalone program that acts as a universal router, proxy and
46 encapsulated metasearcher for information retrieval protocols such
47 as Z39.50, and in the future SRU and SRW. To clients, it acts as a
49 protocols: it can be searched, records can be retrieved from it,
50 etc. To servers, it acts as a client: it searches in them,
51 retrieves records from them, etc. it satisfies its clients'
52 requests by transforming them, multiplexing them, forwarding them
53 on to zero or more servers, merging the results, transforming
54 them, and delivering them back to the client. In addition, it
55 acts as a simple HTTP server; support for further protocols can be
56 added in a module fashion, through the creation of new filters.
61 Cold bananas, fish, pyjamas,
62 Mutton, beef and trout!
63 - attributed to Cole Porter.
66 Metaproxy is a more capable alternative to
67 <ulink url="http://indexdata.dk/yazproxy/">YAZ Proxy</ulink>,
68 being more powerful, flexible, configurable and extensible. Among
69 its many advantages over the older, more pedestrian work are
70 support for multiplexing (encapsulated metasearching), routing by
71 database name, authentication and authorisation and serving local
72 files via HTTP. Equally significant, its modular architecture
73 facilitites the creation of pluggable modules implementing further
80 <chapter id="licence">
81 <title>The Metaproxy Licence</title>
83 <emphasis role="strong">
84 No decision has yet been made on the terms under which
85 Metaproxy will be distributed.
87 It is possible that, unlike
88 other Index Data products, metaproxy may not be released under a
89 free-software licence such as the GNU GPL. Until a decision is
90 made and a public statement made, then, and unless it has been
91 delivered to you other specific terms, please treat Metaproxy as
92 though it were proprietary software.
93 The code should not be redistributed without explicit
94 written permission from the copyright holders, Index Data ApS.
100 <chapter id="architecture">
101 <title>The Metaproxy Architecture</title>
103 The Metaproxy architecture is based on three concepts:
104 the <emphasis>package</emphasis>,
105 the <emphasis>route</emphasis>
106 and the <emphasis>filter</emphasis>.
110 <term>Packages</term>
113 A package is request or response, encoded in some protocol,
114 issued by a client, making its way through Metaproxy, send to or
115 received from a server, or sent back to the client.
118 The core of a package is the protocol unit - for example, a
119 Z39.50 Init Request or Search Response, or an SRU searchRetrieve
120 URL or Explain Response. In addition to this core, a package
121 also carries some extra information added and used by Metaproxy
125 In general, packages are doctored as they pass through
126 Metaproxy. For example, when the proxy performs authentication
127 and authorisation on a Z39.50 Init request, it removes the
128 authentication credentials from the package so that they are not
129 passed onto the back-end server; and when search-response
130 packages are obtained from multiple servers, they are merged
131 into a single unified package that makes its way back to the
140 Packages make their way through routes, which can be thought of
141 as programs that operate on the package data-type. Each
142 incoming package initially makes its way through a default
143 route, but may be switched to a different route based on various
144 considerations. Routes are made up of sequences of filters (see
153 Filters provide the individual instructions within a route, and
154 effect the necessary transformations on packages. A particular
155 configuration of Metaproxy is essentially a set of filters,
156 described by configuration details and arranged in order in one
157 or more routes. There are many kinds of filter - about a dozen
158 at the time of writing with more appearing all the time - each
159 performing a specific function and configured by different
163 The word ``filter'' is sometimes used rather loosely, in two
164 different ways: it may be used to mean a particular
165 <emphasis>type</emphasis> of filter, as when we speak of ``the
166 auth_simplefilter'' or ``the multi filter''; or it may be used
167 to be a specific <emphasis>instance</emphasis> of a filter
168 within a Metaproxy configuration. For example, a single
169 configuration will often contain multiple instances of the
170 <literal>z3950_client</literal> filter. In
171 operational terms, of these is a separate filter. In practice,
172 context always make it clear which sense of the word ``filter''
176 Extensibility of Metaproxy is primarily through the creation of
177 plugins that provide new filters. The filter API is small and
178 conceptually simple, but there are many details to master. See
180 <link linkend="extensions">extensions</link>.
186 Since packages are created and handled by the system itself, and
187 routes are conceptually simple, most of the remainder of this
188 document concentrates on filters. After a brief overview of the
189 filter types follows, along with some thoughts on possible future
196 <chapter id="filters">
197 <title>Filters</title>
201 <title>Introductory notes</title>
203 It's useful to think of Metaproxy as an interpreter providing a small
204 number of primitives and operations, but operating on a very
205 complex data type, namely the ``package''.
208 A package represents a Z39.50 or SRU/W request (whether for Init,
209 Search, Scan, etc.) together with information about where it came
210 from. Packages are created by front-end filters such as
211 <literal>frontend_net</literal> (see below), which reads them from
212 the network; other front-end filters are possible. They then pass
213 along a route consisting of a sequence of filters, each of which
214 transforms the package and may also have side-effects such as
215 generating logging. Eventually, the route will yield a response,
216 which is sent back to the origin.
219 There are many kinds of filter: some that are defined statically
220 as part of Metaproxy, and others may be provided by third parties
221 and dynamically loaded. They all conform to the same simple API
222 of essentially two methods: <function>configure()</function> is
223 called at startup time, and is passed a DOM tree representing that
224 part of the configuration file that pertains to this filter
225 instance: it is expected to walk that tree extracting relevant
226 information; and <function>process()</function> is called every
227 time the filter has to processes a package.
230 While all filters provide the same API, there are different modes
231 of functionality. Some filters are sources: they create
233 (<literal>frontend_net</literal>);
234 others are sinks: they consume packages and return a result
235 (<literal>z3950_client</literal>,
236 <literal>backend_test</literal>,
237 <literal>http_file</literal>);
238 the others are true filters, that read, process and pass on the
239 packages they are fed
240 (<literal>auth_simple</literal>,
241 <literal>log</literal>,
242 <literal>multi</literal>,
243 <literal>query_rewrite</literal>,
244 <literal>session_shared</literal>,
245 <literal>template</literal>,
246 <literal>virt_db</literal>).
252 <title>Overview of filter types</title>
254 We now briefly consider each of the types of filter supported by
255 the core Metaproxy binary. This overview is intended to give a
256 flavour of the available functionality; more detailed information
257 about each type of filter is included below in the Module
261 The filters are here named by the string that is used as the
262 <literal>type</literal> attribute of a
263 <literal><filter></literal> element in the configuration
264 file to request them, with the name of the class that implements
265 them in parentheses. (The classname is not needed for normal
266 configuration and use of Metaproxy; it is useful only to
270 The filters are here listed in alphabetical order:
274 <title><literal>auth_simple</literal>
275 (mp::filter::AuthSimple)</title>
277 Simple authentication and authorisation. The configuration
278 specifies the name of a file that is the user register, which
279 lists <varname>username</varname>:<varname>password</varname>
280 pairs, one per line, colon separated. When a session begins, it
281 is rejected unless username and passsword are supplied, and match
282 a pair in the register. The configuration file may also specific
283 the name of another file that is the target register: this lists
284 lists <varname>username</varname>:<varname>dbname</varname>,<varname>dbname</varname>...
285 sets, one per line, with multiple database names separated by
286 commas. When a search is processed, it is rejected unless the
287 database to be searched is one of those listed as available to
293 <title><literal>backend_test</literal>
294 (mp::filter::Backend_test)</title>
296 A sink that provides dummy responses in the manner of the
297 <literal>yaz-ztest</literal> Z39.50 server. This is useful only
298 for testing. Seriously, you don't need this. Pretend you didn't
299 even read this section.
304 <title><literal>frontend_net</literal>
305 (mp::filter::FrontendNet)</title>
307 A source that accepts Z39.50 connections from a port
308 specified in the configuration, reads protocol units, and
309 feeds them into the next filter in the route. When the result is
310 revceived, it is returned to the original origin.
315 <title><literal>http_file</literal>
316 (mp::filter::HttpFile)</title>
318 A sink that returns the contents of files from the local
319 filesystem in response to HTTP requests. (Yes, Virginia, this
320 does mean that Metaproxy is also a Web-server in its spare time. So
321 far it does not contain either an email-reader or a Lisp
322 interpreter, but that day is surely coming.)
327 <title><literal>log</literal>
328 (mp::filter::Log)</title>
330 Writes logging information to standard output, and passes on
331 the package unchanged.
336 <title><literal>multi</literal>
337 (mp::filter::Multi)</title>
339 Performs multicast searching.
341 <link linkend="multidb">the extended discussion</link>
342 of virtual databases and multi-database searching below.
347 <title><literal>query_rewrite</literal>
348 (mp::filter::QueryRewrite)</title>
350 Rewrites Z39.50 Type-1 and Type-101 (``RPN'') queries by a
351 three-step process: the query is transliterated from Z39.50
352 packet structures into an XML representation; that XML
353 representation is transformed by an XSLT stylesheet; and the
354 resulting XML is transliterated back into the Z39.50 packet
360 <title><literal>session_shared</literal>
361 (mp::filter::SessionShared)</title>
363 When this is finished, it will implement global sharing of
364 result sets (i.e. between threads and therefore between
365 clients), yielding performance improvements especially when
366 incoming requests are from a stateless environment such as a
367 web-server, in which the client process representing a session
368 might be any one of many. However:
372 This filter is not yet completed.
378 <title><literal>template</literal>
379 (mp::filter::Template)</title>
381 Does nothing at all, merely passing the packet on. (Maybe it
382 should be called <literal>nop</literal> or
383 <literal>passthrough</literal>?) This exists not to be used, but
384 to be copied - to become the skeleton of new filters as they are
385 written. As with <literal>backend_test</literal>, this is not
386 intended for civilians.
391 <title><literal>virt_db</literal>
392 (mp::filter::Virt_db)</title>
394 Performs virtual database selection: based on the name of the
395 database in the search request, a server is selected, and its
396 address added to the request in a <literal>VAL_PROXY</literal>
397 otherInfo packet. It will subsequently be used by a
398 <literal>z3950_client</literal> filter.
400 <link linkend="multidb">the extended discussion</link>
401 of virtual databases and multi-database searching below.
406 <title><literal>z3950_client</literal>
407 (mp::filter::Z3950Client)</title>
409 Performs Z39.50 searching and retrieval by proxying the
410 packages that are passed to it. Init requests are sent to the
411 address specified in the <literal>VAL_PROXY</literal> otherInfo
412 attached to the request: this may have been specified by client,
413 or generated by a <literal>virt_db</literal> filter earlier in
414 the route. Subsequent requests are sent to the same address,
415 which is remembered at Init time in a Session object.
422 <title>Future directions</title>
424 Some other filters that do not yet exist, but which would be
425 useful, are briefly described. These may be added in future
426 releases (or may be created by third parties, as loadable
432 <term><literal>frontend_cli</literal> (source)</term>
435 Command-line interface for generating requests.
440 <term><literal>frontend_sru</literal> (source)</term>
443 Receive SRU (and perhaps SRW) requests.
448 <term><literal>sru2z3950</literal> (filter)</term>
451 Translate SRU requests into Z39.50 requests.
456 <term><literal>sru_client</literal> (sink)</term>
459 SRU searching and retrieval.
464 <term><literal>srw_client</literal> (sink)</term>
467 SRW searching and retrieval.
472 <term><literal>opensearch_client</literal> (sink)</term>
475 A9 OpenSearch searching and retrieval.
485 <chapter id="multidb">
486 <title>Virtual databases and multi-database searching</title>
490 <title>Introductory notes</title>
492 Two of Metaproxy's filters are concerned with multiple-database
493 operations. Of these, <literal>virt_db</literal> can work alone
494 to control the routing of searches to one of a number of servers,
495 while <literal>multi</literal> can work with the output of
496 <literal>virt_db</literal> to perform multicast searching, merging
497 the results into a unified result-set. The interaction between
498 these two filters is necessarily complex, reflecting the real
499 complexity of multicast searching in a protocol such as Z39.50
500 that separates initialisation from searching, with the database to
501 search known only during the latter operation.
504 ### Much, much more to say!
511 <chapter id="configuration">
512 <title>Configuration: the Metaproxy configuration file format</title>
516 <title>Introductory notes</title>
518 If Metaproxy is an interpreter providing operations on packages, then
519 its configuration file can be thought of as a program for that
520 interpreter. Configuration is by means of a single file, the name
521 of which is supplied as the sole command-line argument to the
522 <command>yp2</command> program.
525 The configuration files are written in XML. (But that's just an
526 implementation detail - they could just as well have been written
527 in YAML or Lisp-like S-expressions, or in a custom syntax.)
530 Since XML has been chosen, an XML schema,
531 <filename>config.xsd</filename>, is provided for validating
532 configuration files. This file is supplied in the
533 <filename>etc</filename> directory of the Metaproxy distribution. It
534 can be used by (among other tools) the <command>xmllint</command>
535 program supplied as part of the <literal>libxml2</literal>
539 xmllint --noout --schema etc/config.xsd my-config-file.xml
542 (A recent version of <literal>libxml2</literal> is required, as
543 support for XML Schemas is a relatively recent addition.)
548 <title>Overview of XML structure</title>
550 All elements and attributes are in the namespace
551 <ulink url="http://indexdata.dk/yp2/config/1"/>.
552 This is most easily achieved by setting the default namespace on
553 the top-level element, as here:
556 <yp2 xmlns="http://indexdata.dk/yp2/config/1">
559 The top-level element is <yp2>. This contains a
560 <start> element, a <filters> element and a
561 <routes> element, in that order. <filters> is
562 optional; the other two are mandatory. All three are
566 The <start> element is empty, but carries a
567 <literal>route</literal> attribute, whose value is the name of
568 route at which to start running - analogouse to the name of the
569 start production in a formal grammar.
572 If present, <filters> contains zero or more <filter>
573 elements; filters carry a <literal>type</literal> attribute and
574 contain various elements that provide suitable configuration for
575 filters of that type. The filter-specific elements are described
576 below. Filters defined in this part of the file must carry an
577 <literal>id</literal> attribute so that they can be referenced
581 <routes> contains one or more <route> elements, each
582 of which must carry an <literal>id</literal> element. One of the
583 routes must have the ID value that was specified as the start
584 route in the <start> element's <literal>route</literal>
585 attribute. Each route contains zero or more <filter>
586 elements. These are of two types. They may be empty, but carry a
587 <literal>refid</literal> attribute whose value is the same as the
588 <literal>id</literal> of a filter previously defined in the
589 <filters> section. Alternatively, a route within a filter
590 may omit the <literal>refid</literal> attribute, but contain
591 configuration elements similar to those used for filters defined
592 in the <filters> section.
598 <title>Filter configuration</title>
600 All <filter> elements have in common that they must carry a
601 <literal>type</literal> attribute whose value is one of the
602 supported ones, listed in the schema file and discussed below. In
603 additional, <filters>s occurring the <filters> section
604 must have an <literal>id</literal> attribute, and those occurring
605 within a route must have either a <literal>refid</literal>
606 attribute referencing a previously defined filter or contain its
607 own configuration information.
610 In general, each filter recognises different configuration
611 elements within its element, as each filter has different
612 functionality. These are as follows:
616 <title><literal>auth_simple</literal></title>
618 <filter type="auth_simple">
619 <userRegister>../etc/example.simple-auth</userRegister>
625 <title><literal>backend_test</literal></title>
627 <filter type="backend_test"/>
632 <title><literal>frontend_net</literal></title>
634 <filter type="frontend_net">
635 <threads>10</threads>
636 <port>@:9000</port>
642 <title><literal>http_file</literal></title>
644 <filter type="http_file">
645 <mimetypes>/etc/mime.types</mimetypes>
647 <documentroot>.</documentroot>
648 <prefix>/etc</prefix>
655 <title><literal>log</literal></title>
657 <filter type="log">
658 <message>B</message>
664 <title><literal>multi</literal></title>
666 <filter type="multi"/>
671 <title><literal>query_rewrite</literal></title>
673 <filter type="query_rewrite">
674 <xslt>pqf2pqf.xsl</xslt>
680 <title><literal>session_shared</literal></title>
682 <filter type="session_shared">
689 <title><literal>template</literal></title>
691 <filter type="template"/>
696 <title><literal>virt_db</literal></title>
698 <filter type="virt_db">
700 <database>loc</database>
701 <target>z3950.loc.gov:7090/voyager</target>
704 <database>idgils</database>
705 <target>indexdata.dk/gils</target>
712 <title><literal>z3950_client</literal></title>
714 <filter type="z3950_client">
715 <timeout>30</timeout>
724 <chapter id="progref">
725 <title>Metaproxy invocation</title>
727 The material in this chapter includes the man pages material.
732 <chapter id="moduleref">
733 <title>Reference guide to Metaproxy filters</title>
735 The material in this chapter includes the man pages material.
740 <chapter id="extensions">
741 <title>Writing extensions for Metaproxy</title>
745 <chapter id="classes">
746 <title>Classes in the Metaproxy source code</title>
750 <title>Introductory notes</title>
752 <emphasis>Stop! Do not read this!</emphasis>
753 You won't enjoy it at all.
756 This chapter contains documentation of the Metaproxy source code, and is
757 of interest only to maintainers and developers. If you need to
758 change Metaproxy's behaviour or write a new filter, then you will most
759 likely find this chapter helpful. Otherwise it's a waste of your
760 good time. Seriously: go and watch a film or something.
761 <citetitle>This is Spinal Tap</citetitle> is particularly good.
764 Still here? OK, let's continue.
767 In general, classes seem to be named big-endianly, so that
768 <literal>FactoryFilter</literal> is not a filter that filters
769 factories, but a factory that produces filters; and
770 <literal>FactoryStatic</literal> is a factory for the statically
771 registered filters (as opposed to those that are dynamically
777 <title>Individual classes</title>
779 The classes making up the Metaproxy application are here listed by
780 class-name, with the names of the source files that define them in
785 <title><literal>mp::FactoryFilter</literal>
786 (<filename>factory_filter.cpp</filename>)</title>
788 A factory class that exists primarily to provide the
789 <literal>create()</literal> method, which takes the name of a
790 filter class as its argument and returns a new filter of that
791 type. To enable this, the factory must first be populated by
792 calling <literal>add_creator()</literal> for static filters (this
793 is done by the <literal>FactoryStatic</literal> class, see below)
794 and <literal>add_creator_dyn()</literal> for filters loaded
800 <title><literal>mp::FactoryStatic</literal>
801 (<filename>factory_static.cpp</filename>)</title>
803 A subclass of <literal>FactoryFilter</literal> which is
804 responsible for registering all the statically defined filter
805 types. It does this by knowing about all those filters'
806 structures, which are listed in its constructor. Merely
807 instantiating this class registers all the static classes. It is
808 for the benefit of this class that <literal>struct
809 yp2_filter_struct</literal> exists, and that all the filter
810 classes provide a static object of that type.
815 <title><literal>mp::filter::Base</literal>
816 (<filename>filter.cpp</filename>)</title>
818 The virtual base class of all filters. The filter API is, on the
819 surface at least, extremely simple: two methods.
820 <literal>configure()</literal> is passed a DOM tree representing
821 that part of the configuration file that pertains to this filter
822 instance, and is expected to walk that tree extracting relevant
823 information. And <literal>process()</literal> processes a
824 package (see below). That surface simplicitly is a bit
825 misleading, as <literal>process()</literal> needs to know a lot
826 about the <literal>Package</literal> class in order to do
832 <title><literal>mp::filter::AuthSimple</literal>,
833 <literal>Backend_test</literal>, etc.
834 (<filename>filter_auth_simple.cpp</filename>,
835 <filename>filter_backend_test.cpp</filename>, etc.)</title>
837 Individual filters. Each of these is implemented by a header and
838 a source file, named <filename>filter_*.hpp</filename> and
839 <filename>filter_*.cpp</filename> respectively. All the header
840 files should be pretty much identical, in that they declare the
841 class, including a private <literal>Rep</literal> class and a
842 member pointer to it, and the two public methods. The only extra
843 information in any filter header is additional private types and
844 members (which should really all be in the <literal>Rep</literal>
845 anyway) and private methods (which should also remain known only
846 to the source file, but C++'s brain-damaged design requires this
847 dirty laundry to be exhibited in public. Thanks, Bjarne!)
850 The source file for each filter needs to supply:
855 A definition of the private <literal>Rep</literal> class.
860 Some boilerplate constructors and destructors.
865 A <literal>configure()</literal> method that uses the
866 appropriate XML fragment.
871 Most important, the <literal>process()</literal> method that
872 does all the actual work.
879 <title><literal>mp::Package</literal>
880 (<filename>package.cpp</filename>)</title>
882 Represents a package on its way through the series of filters
883 that make up a route. This is essentially a Z39.50 or SRU APDU
884 together with information about where it came from, which is
885 modified as it passes through the various filters.
890 <title><literal>mp::Pipe</literal>
891 (<filename>pipe.cpp</filename>)</title>
893 This class provides a compatibility layer so that we have an IPC
894 mechanism that works the same under Unix and Windows. It's not
895 particularly exciting.
900 <title><literal>mp::RouterChain</literal>
901 (<filename>router_chain.cpp</filename>)</title>
908 <title><literal>mp::RouterFleXML</literal>
909 (<filename>router_flexml.cpp</filename>)</title>
916 <title><literal>mp::Session</literal>
917 (<filename>session.cpp</filename>)</title>
924 <title><literal>mp::ThreadPoolSocketObserver</literal>
925 (<filename>thread_pool_observer.cpp</filename>)</title>
932 <title><literal>mp::util</literal>
933 (<filename>util.cpp</filename>)</title>
935 A namespace of various small utility functions and classes,
936 collected together for convenience. Most importantly, includes
937 the <literal>mp::util::odr</literal> class, a wrapper for YAZ's
943 <title><literal>mp::xml</literal>
944 (<filename>xmlutil.cpp</filename>)</title>
946 A namespace of various XML utility functions and classes,
947 collected together for convenience.
954 <title>Other Source Files</title>
956 In addition to the Metaproxy source files that define the classes
957 described above, there are a few additional files which are
958 briefly described here:
962 <term><literal>metaproxy_prog.cpp</literal></term>
965 The main function of the <command>yp2</command> program.
970 <term><literal>ex_router_flexml.cpp</literal></term>
973 Identical to <literal>metaproxy_prog.cpp</literal>: it's not clear why.
978 <term><literal>test_*.cpp</literal></term>
981 Unit-tests for various modules.
987 ### Still to be described:
988 <literal>ex_filter_frontend_net.cpp</literal>,
989 <literal>filter_dl.cpp</literal>,
990 <literal>plainfile.cpp</literal>,
991 <literal>tstdl.cpp</literal>.
1000 <!-- This is just a lame way to get some vertical whitespace at
1001 the end of the document -->
1010 <!-- Keep this comment at the end of the file
1015 sgml-minimize-attributes:nil
1016 sgml-always-quote-attributes:t
1019 sgml-parent-document: "main.xml"
1020 sgml-local-catalogs: nil
1021 sgml-namecase-general:t
1022 nxml-child-indent: 1