1 <!-- $Id: book.xml,v 1.5 2006-03-31 16:05:27 mike Exp $ -->
3 <title>Metaproxy - User's Guide and Reference</title>
5 <firstname>Mike</firstname><surname>Taylor</surname>
8 <firstname>Adam</firstname><surname>Dickmeiss</surname>
12 <holder>Index Data</holder>
16 Metaproxy - universal Z39.50/SRU router, proxy and encapsulated metasearcher
21 <chapter id="introduction">
22 <title>Introduction</title>
26 <title>Overview</title>
28 <ulink url="http://indexdata.dk/metaproxy/">Metaproxy</ulink>
29 is a standalone program that acts as a universal router, proxy and
30 encapsulated metasearcher for information retrieval protocols such
31 as Z39.50 and SRU/SRW. To clients, it acts as a server of these
32 protocols: it can be searched, records can be retrieved from it,
33 etc. To servers, it acts as a client: it searches in them,
34 retrieves records from them, etc. it satisfies its clients'
35 requests by transforming them, multiplexing them, forwarding them
36 on to zero or more servers, merging the results, transforming
37 them, and delivering them back to the client.
40 Metaproxy is a more capable alternative to
41 <ulink url="http://indexdata.dk/yazproxy/">YAZ Proxy</ulink>,
42 being more powerful, flexible, configurable and extensible. Among
43 its many advantages over the older, more pedestrian work are
44 support for multiplexing (encapsulated metasearching), routing by
45 database name, authentication and authorisation and serving local
46 files via HTTP. Equally significant, its modular architecture
47 facilitites the creation of pluggable modules implementing further
55 <chapter id="licence">
56 <title>The Metaproxy Licence</title>
58 <emphasis role="strong">
59 No decision has yet been made on the terms under which
60 Metaproxy will be distributed.
62 It is possible that, unlike
63 other Index Data products, metaproxy may not be released under a
64 free-software licence such as the GNU GPL. Until a decision is
65 made and a public statement made, then, and unless it has been
66 delivered to you other specific terms, please treat Metaproxy as
67 though it were proprietary software.
73 <chapter id="architecture">
74 <title>The Metaproxy Architecture</title>
76 The Metaproxy architecture is based on three concepts:
77 the <emphasis>package</emphasis>,
78 the <emphasis>route</emphasis>
79 and the <emphasis>filter</emphasis>.
86 A package is request or response, encoded in some protocol,
87 issued by a client, making its way through Metaproxy, send to or
88 received from a server, or sent back to the client.
91 The core of a package is the protocol unit - for example, a
92 Z39.50 Init Request or Search Response, or an SRU searchRetrieve
93 URL or Explain Response. In addition to this core, a package
94 also carries some extra information added and used by Metaproxy
98 In general, packages are doctored as they pass through
99 Metaproxy. For example, when the proxy performs authentication
100 and authorisation on a Z39.50 Init request, it removes the
101 authentication credentials from the package so that they are not
102 passed onto the back-end server; and when search-response
103 packages are obtained from multiple servers, they are merged
104 into a single unified package that makes its way back to the
113 Packages make their way through routes, which can be thought of
114 as programs that operate on the package data-type. Each
115 incoming package initially makes its way through a default
116 route, but may be switched to a different route based on various
117 considerations. Routes are made up of sequences of filters (see
126 Filters provide the individual instructions within a route, and
127 effect the necessary transformations on packages. A particular
128 configuration of Metaproxy is essentially a set of filters,
129 described by configuration details and arranged in order in one
130 or more routes. There are many kinds of filter - about a dozen
131 at the time of writing with more appearing all the time - each
132 performing a specific function and configured by different
136 The word ``filter'' is sometimes used rather loosely, in two
137 different ways: it may be used to mean a particular
138 <emphasis>type</emphasis> of filter, as when we speak of ``the
139 auth_simplefilter'' or ``the multi filter''; or it may be used
140 to be a specific instance of a filter within a Metaproxy
141 configuration. For example, a single configuration will often
142 contain multiple instances of the z3950_client filter. In
143 operational terms, of these is a separate filter. In practice,
144 context always make it clear which sense of the word ``filter''
148 Extensibility of Metaproxy is primarily through the creation of
149 plugins that provide new filters. The filter API is small and
150 conceptually simple, but there are many details to master. See
152 <link linkend="extensions">extensions</link>.
158 Since packages are created and handled by the system itself, and
159 routes are conceptually simple, most of the remainder of this
160 document concentrates on filters. After a brief overview of the
161 filter types follows, along with some thoughts on possible future
168 <chapter id="filters">
169 <title>Filters</title>
173 <title>Introductory notes</title>
175 It's useful to think of Metaproxy as an interpreter providing a small
176 number of primitives and operations, but operating on a very
177 complex data type, namely the ``package''.
180 A package represents a Z39.50 or SRW/U request (whether for Init,
181 Search, Scan, etc.) together with information about where it came
182 from. Packages are created by front-end filters such as
183 <literal>frontend_net</literal> (see below), which reads them from
184 the network; other front-end filters are possible. They then pass
185 along a route consisting of a sequence of filters, each of which
186 transforms the package and may also have side-effects such as
187 generating logging. Eventually, the route will yield a response,
188 which is sent back to the origin.
191 There are many kinds of filter: some that are defined statically
192 as part of Metaproxy, and other that may be provided by third parties
193 and dynamically loaded. They all conform to the same simple API
194 of essentially two methods: <function>configure()</function> is
195 called at startup time, and is passed a DOM tree representing that
196 part of the configuration file that pertains to this filter
197 instance: it is expected to walk that tree extracting relevant
198 information; and <function>process()</function> is called every
199 time the filter has to processes a package.
202 While all filters provide the same API, there are different modes
203 of functionality. Some filters are sources: they create
205 (<literal>frontend_net</literal>);
206 others are sinks: they consume packages and return a result
207 (<literal>z3950_client</literal>,
208 <literal>backend_test</literal>,
209 <literal>http_file</literal>);
210 the others are true filters, that read, process and pass on the
211 packages they are fed
212 (<literal>auth_simple</literal>,
213 <literal>log</literal>,
214 <literal>multi</literal>,
215 <literal>query_rewrite</literal>,
216 <literal>session_shared</literal>,
217 <literal>template</literal>,
218 <literal>virt_db</literal>).
224 <title>Individual filters</title>
226 The filters are here named by the string that is used as the
227 <literal>type</literal> attribute of a
228 <literal><filter></literal> element in the configuration
229 file to request them, with the name of the class that implements
234 <title><literal>auth_simple</literal>
235 (mp::filter::AuthSimple)</title>
237 Simple authentication and authorisation. The configuration
238 specifies the name of a file that is the user register, which
239 lists <varname>username</varname>:<varname>password</varname>
240 pairs, one per line, colon separated. When a session begins, it
241 is rejected unless username and passsword are supplied, and match
242 a pair in the register.
245 ### discuss authorisation phase
250 <title><literal>backend_test</literal>
251 (mp::filter::Backend_test)</title>
253 A sink that provides dummy responses in the manner of the
254 <literal>yaz-ztest</literal> Z39.50 server. This is useful only
260 <title><literal>frontend_net</literal>
261 (mp::filter::FrontendNet)</title>
263 A source that accepts Z39.50 and SRW connections from a port
264 specified in the configuration, reads protocol units, and
265 feeds them into the next filter, eventually returning the
266 result to the origin.
271 <title><literal>http_file</literal>
272 (mp::filter::HttpFile)</title>
274 A sink that returns the contents of files from the local
275 filesystem in response to HTTP requests. (Yes, Virginia, this
276 does mean that Metaproxy is also a Web-server in its spare time. So
277 far it does not contain either an email-reader or a Lisp
278 interpreter, but that day is surely coming.)
283 <title><literal>log</literal>
284 (mp::filter::Log)</title>
286 Writes logging information to standard output, and passes on
287 the package unchanged.
292 <title><literal>multi</literal>
293 (mp::filter::Multi)</title>
295 Performs multicast searching. See the extended discussion of
296 multi-database searching below.
301 <title><literal>session_shared</literal>
302 (mp::filter::SessionShared)</title>
304 When this is finished, it will implement global sharing of
305 result sets (i.e. between threads and therefore between
306 clients), but it's not yet done.
311 <title><literal>template</literal>
312 (mp::filter::Template)</title>
314 Does nothing at all, merely passing the packet on. (Maybe it
315 should be called <literal>nop</literal> or
316 <literal>passthrough</literal>?) This exists not to be used, but
317 to be copied - to become the skeleton of new filters as they are
323 <title><literal>virt_db</literal>
324 (mp::filter::Virt_db)</title>
326 Performs virtual database selection. See the extended discussion
327 of virtual databases below.
332 <title><literal>z3950_client</literal>
333 (mp::filter::Z3950Client)</title>
335 Performs Z39.50 searching and retrieval by proxying the
336 packages that are passed to it. Init requests are sent to the
337 address specified in the <literal>VAL_PROXY</literal> otherInfo
338 attached to the request: this may have been specified by client,
339 or generated by a <literal>virt_db</literal> filter earlier in
340 the route. Subsequent requests are sent to the same address,
341 which is remembered at Init time in a Session object.
348 <title>Future directions</title>
350 Some other filters that do not yet exist, but which would be
351 useful, are briefly described. These may be added in future
357 <term><literal>frontend_cli</literal> (source)</term>
360 Command-line interface for generating requests.
365 <term><literal>srw2z3950</literal> (filter)</term>
368 Translate SRW requests into Z39.50 requests.
373 <term><literal>srw_client</literal> (sink)</term>
376 SRW searching and retrieval.
381 <term><literal>sru_client</literal> (sink)</term>
384 SRU searching and retrieval.
389 <term><literal>opensearch_client</literal> (sink)</term>
392 A9 OpenSearch searching and retrieval.
402 <chapter id="configuration">
403 <title>Configuration: the Metaproxy configuration file format</title>
407 <title>Introductory notes</title>
409 If Metaproxy is an interpreter providing operations on packages, then
410 its configuration file can be thought of as a program for that
411 interpreter. Configuration is by means of a single file, the name
412 of which is supplied as the sole command-line argument to the
413 <command>yp2</command> program.
416 The configuration files are written in XML. (But that's just an
417 implementation detail - they could just as well have been written
418 in YAML or Lisp-like S-expressions, or in a custom syntax.)
421 Since XML has been chosen, an XML schema,
422 <filename>config.xsd</filename>, is provided for validating
423 configuration files. This file is supplied in the
424 <filename>etc</filename> directory of the Metaproxy distribution. It
425 can be used by (among other tools) the <command>xmllint</command>
426 program supplied as part of the <literal>libxml2</literal>
430 xmllint --noout --schema etc/config.xsd my-config-file.xml
433 (A recent version of <literal>libxml2</literal> is required, as
434 support for XML Schemas is a relatively recent addition.)
439 <title>Overview of XML structure</title>
441 All elements and attributes are in the namespace
442 <ulink url="http://indexdata.dk/yp2/config/1"/>.
443 This is most easily achieved by setting the default namespace on
444 the top-level element, as here:
447 <yp2 xmlns="http://indexdata.dk/yp2/config/1">
450 The top-level element is <yp2>. This contains a
451 <start> element, a <filters> element and a
452 <routes> element, in that order. <filters> is
453 optional; the other two are mandatory. All three are
457 The <start> element is empty, but carries a
458 <literal>route</literal> attribute, whose value is the name of
459 route at which to start running - analogouse to the name of the
460 start production in a formal grammar.
463 If present, <filters> contains zero or more <filter>
464 elements; filters carry a <literal>type</literal> attribute and
465 contain various elements that provide suitable configuration for
466 filters of that type. The filter-specific elements are described
467 below. Filters defined in this part of the file must carry an
468 <literal>id</literal> attribute so that they can be referenced
472 <routes> contains one or more <route> elements, each
473 of which must carry an <literal>id</literal> element. One of the
474 routes must have the ID value that was specified as the start
475 route in the <start> element's <literal>route</literal>
476 attribute. Each route contains zero or more <filter>
477 elements. These are of two types. They may be empty, but carry a
478 <literal>refid</literal> attribute whose value is the same as the
479 <literal>id</literal> of a filter previously defined in the
480 <filters> section. Alternatively, a route within a filter
481 may omit the <literal>refid</literal> attribute, but contain
482 configuration elements similar to those used for filters defined
483 in the <filters> section.
489 <title>Filter configuration</title>
491 All <filter> elements have in common that they must carry a
492 <literal>type</literal> attribute whose value is one of the
493 supported ones, listed in the schema file and discussed below. In
494 additional, <filters>s occurring the <filters> section
495 must have an <literal>id</literal> attribute, and those occurring
496 within a route must have either a <literal>refid</literal>
497 attribute referencing a previously defined filter or contain its
498 own configuration information.
501 In general, each filter recognises different configuration
502 elements within its element, as each filter has different
503 functionality. These are as follows:
507 <title><literal>auth_simple</literal></title>
509 <filter type="auth_simple">
510 <userRegister>../etc/example.simple-auth</userRegister>
516 <title><literal>backend_test</literal></title>
518 <filter type="backend_test"/>
523 <title><literal>frontend_net</literal></title>
525 <filter type="frontend_net">
526 <threads>10</threads>
527 <port>@:9000</port>
533 <title><literal>http_file</literal></title>
535 <filter type="http_file">
536 <mimetypes>/etc/mime.types</mimetypes>
538 <documentroot>.</documentroot>
539 <prefix>/etc</prefix>
546 <title><literal>log</literal></title>
548 <filter type="log">
549 <message>B</message>
555 <title><literal>multi</literal></title>
557 <filter type="multi"/>
562 <title><literal>session_shared</literal></title>
564 <filter type="session_shared">
571 <title><literal>template</literal></title>
573 <filter type="template"/>
578 <title><literal>virt_db</literal></title>
580 <filter type="virt_db">
582 <database>loc</database>
583 <target>z3950.loc.gov:7090/voyager</target>
586 <database>idgils</database>
587 <target>indexdata.dk/gils</target>
594 <title><literal>z3950_client</literal></title>
596 <filter type="z3950_client">
597 <timeout>30</timeout>
606 <chapter id="multidb">
607 <title>Virtual database as multi-database searching</title>
611 <title>Introductory notes</title>
613 Two of Metaproxy's filters are concerned with multiple-database
614 operations. Of these, <literal>virt_db</literal> can work alone
615 to control the routing of searches to one of a number of servers,
616 while <literal>multi</literal> can work with the output of
617 <literal>virt_db</literal> to perform multicast searching, merging
618 the results into a unified result-set. The interaction between
619 these two filters is necessarily complex, reflecting the real
620 complexity of multicast searching in a protocol such as Z39.50
621 that separates initialisation from searching, with the database to
622 search known only during the latter operation.
625 ### Much, much more to say!
630 <chapter id="moduleref">
631 <title>Module Reference</title>
633 The material in this chapter includes the man pages material
638 <chapter id="extensions">
639 <title>Writing extensions for Metaproxy</title>
643 <chapter id="classes">
644 <title>Classes in the Metaproxy source code</title>
648 <title>Introductory notes</title>
650 <emphasis>Stop! Do not read this!</emphasis>
651 You won't enjoy it at all.
654 This chapter contains documentation of the Metaproxy source code, and is
655 of interest only to maintainers and developers. If you need to
656 change Metaproxy's behaviour or write a new filter, then you will most
657 likely find this chapter helpful. Otherwise it's a waste of your
658 good time. Seriously: go and watch a film or something.
659 <citetitle>This is Spinal Tap</citetitle> is particularly good.
662 Still here? OK, let's continue.
665 In general, classes seem to be named big-endianly, so that
666 <literal>FactoryFilter</literal> is not a filter that filters
667 factories, but a factory that produces filters; and
668 <literal>FactoryStatic</literal> is a factory for the statically
669 registered filters (as opposed to those that are dynamically
675 <title>Individual classes</title>
677 The classes making up the Metaproxy application are here listed by
678 class-name, with the names of the source files that define them in
683 <title><literal>mp::FactoryFilter</literal>
684 (<filename>factory_filter.cpp</filename>)</title>
686 A factory class that exists primarily to provide the
687 <literal>create()</literal> method, which takes the name of a
688 filter class as its argument and returns a new filter of that
689 type. To enable this, the factory must first be populated by
690 calling <literal>add_creator()</literal> for static filters (this
691 is done by the <literal>FactoryStatic</literal> class, see below)
692 and <literal>add_creator_dyn()</literal> for filters loaded
698 <title><literal>mp::FactoryStatic</literal>
699 (<filename>factory_static.cpp</filename>)</title>
701 A subclass of <literal>FactoryFilter</literal> which is
702 responsible for registering all the statically defined filter
703 types. It does this by knowing about all those filters'
704 structures, which are listed in its constructor. Merely
705 instantiating this class registers all the static classes. It is
706 for the benefit of this class that <literal>struct
707 yp2_filter_struct</literal> exists, and that all the filter
708 classes provide a static object of that type.
713 <title><literal>mp::filter::Base</literal>
714 (<filename>filter.cpp</filename>)</title>
716 The virtual base class of all filters. The filter API is, on the
717 surface at least, extremely simple: two methods.
718 <literal>configure()</literal> is passed a DOM tree representing
719 that part of the configuration file that pertains to this filter
720 instance, and is expected to walk that tree extracting relevant
721 information. And <literal>process()</literal> processes a
722 package (see below). That surface simplicitly is a bit
723 misleading, as <literal>process()</literal> needs to know a lot
724 about the <literal>Package</literal> class in order to do
730 <title><literal>mp::filter::AuthSimple</literal>,
731 <literal>Backend_test</literal>, etc.
732 (<filename>filter_auth_simple.cpp</filename>,
733 <filename>filter_backend_test.cpp</filename>, etc.)</title>
735 Individual filters. Each of these is implemented by a header and
736 a source file, named <filename>filter_*.hpp</filename> and
737 <filename>filter_*.cpp</filename> respectively. All the header
738 files should be pretty much identical, in that they declare the
739 class, including a private <literal>Rep</literal> class and a
740 member pointer to it, and the two public methods. The only extra
741 information in any filter header is additional private types and
742 members (which should really all be in the <literal>Rep</literal>
743 anyway) and private methods (which should also remain known only
744 to the source file, but C++'s brain-damaged design requires this
745 dirty laundry to be exhibited in public. Thanks, Bjarne!)
748 The source file for each filter needs to supply:
753 A definition of the private <literal>Rep</literal> class.
758 Some boilerplate constructors and destructors.
763 A <literal>configure()</literal> method that uses the
764 appropriate XML fragment.
769 Most important, the <literal>process()</literal> method that
770 does all the actual work.
777 <title><literal>mp::Package</literal>
778 (<filename>package.cpp</filename>)</title>
780 Represents a package on its way through the series of filters
781 that make up a route. This is essentially a Z39.50 or SRU APDU
782 together with information about where it came from, which is
783 modified as it passes through the various filters.
788 <title><literal>mp::Pipe</literal>
789 (<filename>pipe.cpp</filename>)</title>
791 This class provides a compatibility layer so that we have an IPC
792 mechanism that works the same under Unix and Windows. It's not
793 particularly exciting.
798 <title><literal>mp::RouterChain</literal>
799 (<filename>router_chain.cpp</filename>)</title>
806 <title><literal>mp::RouterFleXML</literal>
807 (<filename>router_flexml.cpp</filename>)</title>
814 <title><literal>mp::Session</literal>
815 (<filename>session.cpp</filename>)</title>
822 <title><literal>mp::ThreadPoolSocketObserver</literal>
823 (<filename>thread_pool_observer.cpp</filename>)</title>
830 <title><literal>mp::util</literal>
831 (<filename>util.cpp</filename>)</title>
833 A namespace of various small utility functions and classes,
834 collected together for convenience. Most importantly, includes
835 the <literal>mp::util::odr</literal> class, a wrapper for YAZ's
841 <title><literal>mp::xml</literal>
842 (<filename>xmlutil.cpp</filename>)</title>
844 A namespace of various XML utility functions and classes,
845 collected together for convenience.
852 <title>Other Source Files</title>
854 In addition to the Metaproxy source files that define the classes
855 described above, there are a few additional files which are
856 briefly described here:
860 <term><literal>metaproxy_prog.cpp</literal></term>
863 The main function of the <command>yp2</command> program.
868 <term><literal>ex_router_flexml.cpp</literal></term>
871 Identical to <literal>metaproxy_prog.cpp</literal>: it's not clear why.
876 <term><literal>test_*.cpp</literal></term>
879 Unit-tests for various modules.
885 ### Still to be described:
886 <literal>ex_filter_frontend_net.cpp</literal>,
887 <literal>filter_dl.cpp</literal>,
888 <literal>plainfile.cpp</literal>,
889 <literal>tstdl.cpp</literal>.
898 <!-- This is just a lame way to get some vertical whitespace at
899 the end of the document -->
908 <!-- Keep this comment at the end of the file
913 sgml-minimize-attributes:nil
914 sgml-always-quote-attributes:t
917 sgml-parent-document: "main.xml"
918 sgml-local-catalogs: nil
919 sgml-namecase-general:t