<section id="connectors">
<title>Connectors to non-standard databases</title>
<para>
- If you wish to connect to commercial or other databases which do not
- support open standards, please contact Index Data on
- <email>info@indexdata.com</email>. We have a
- proprietary framework for building connectors that enable Pazpar2
- to access
- thousands of online databases, in addition to the vast number of catalogs
- and online services that support the Z39.50/SRU/SRW/SOLR protocols.
+ If you need to access commercial or open access resources that don't support
+ Z39.50 or SRU, one approach would be to use a tool like <ulink
+ url="&url.simpleserver;">SimpleServer</ulink> to build a
+ gateway. An easier option is to use Index Data's <ulink
+ url="&url.mkc;">MasterKey Connect</ulink>
+ service, which will expose virtually <emphasis>any</emphasis> resource
+ through Z39.50/SRU, dead easy to integrate with Pazpar2.
+ The service is hosted, so all you have to do is to let us
+ know which resources you are interested in, and we operate the gateways,
+ or Connectors for you for a low annual charge.
+ Types of resources supported include
+ commercial databases, free online resources, and even local resources;
+ almost anything that can be accessed through a web-facing user
+ interface can be accessed in this way.
+ Contact <email>info@indexdata.com</email> for more information.
+ See <xref linkend="masterkey_connect"/> for an example.
</para>
</section>
-
+
<section id="name">
<title>A note on the name Pazpar2</title>
<para>
§-ajaxdev;
- <section id="nonstandard">
- <title>Connecting to non-standard resources</title>
- <para>
- Pazpar2 uses Z39.50 as its switchboard language -- i.e. as far as it
- is concerned, all resources speak Z39.50, its webservices derivatives,
- SRU/SRW and SOLR servers exposing Lucene indexes. It is, however, equipped
- to handle a broad range of different server behavior, through
- configurable query mapping and record normalization. If you develop
- configuration, stylesheets, etc., for a new type of resources, we
- encourage you to share your work. But you can also use Pazpar2 to
- connect to hundreds of resources that do not support standard
- protocols.
- </para>
-
- <para>
- For a growing number of resources, Z39.50 is all you need. Over the
- last few years, a number of commercial, full-text resources have
- implemented Z39.50. These can be used through Pazpar2 with little or
- no effort. Resources that use non-standard record formats will
- require a bit of XSLT work, but that's all.
- </para>
-
- <para>
- But what about resources that don't support Z39.50 at all?
- Some resources might support OpenSearch, private, XML/HTTP-based
- protocols, or something else entirely.
- Some databases exist only as web user interfaces and
- will require screen-scraping. Still others exist only as static
- files, or perhaps as databases supporting the OAI-PMH protocol.
- There is hope! Read on.
- </para>
-
- <para>
- Index Data continues to advocate the support of open standards. We
- work with database vendors to support standards, so you don't have
- to worry about programming against non-standard services. We also
- provide tools (see <ulink
- url="http://www.indexdata.com/simpleserver">SimpleServer</ulink>)
- which make it comparatively easy to build gateways against servers
- with non-standard behavior. Again, we encourage you to share any
- work you do in this direction.
- </para>
-
- <para>
- But the bottom line is that working with non-standard resources in
- metasearching is really, really hard. If you want to build a
- project with Pazpar2, and you need access to resources with
- non-standard interfaces, we can help. We run gateways to more than
- 2,000 popular, commercial databases and other resources,
- making it simple
- to plug them directly into Pazpar2. For a small annual fee per
- database, we can help you establish connections to your licensed
- resources. Meanwhile, you can help! If you build your own
- standards-compliant gateways, host them for others, or share the
- code! And tell your vendors that they can save everybody money and
- increase the appeal of their resources by supporting standards.
- </para>
-
- <para>
- There are those who will ask us why we are using Z39.50 as our
- switchboard language rather than a different protocol. Basically,
- we believe that Z39.50 is presently the most widely implemented
- information retrieval protocol that has the level of functionality
- required to support a good metasearching experience (structured
- searching, structured, well-defined results). It is also compact and
- efficient, and there is a very broad range of tools available to
- implement it.
- </para>
- </section>
-
<section id="unicode">
<title>Unicode Compliance</title>
<para>
</section>
+ <section id="relevance_ranking">
+ <title>Relevance ranking</title>
+ <para>
+ Pazpar2 uses a variant of the fterm frequency–inverse document frequency
+ (Tf-idf) ranking algorithm.
+ </para>
+ <para>
+ The Tf-part is straightforward to calculate and is based on the
+ documents that Pazpar2 fetches. The idf-part, however, is more tricky
+ since the corpus at hand is ONLY the relevant documents and not
+ irrelevant ones. Pazpar2 does not have the full corpus -- only the
+ documents that match a particular search.
+ </para>
+ <para>
+ Computatation of the Tf-part is based on the normalized documents.
+ The length, the position and terms are thus normalized at this point.
+ Also the computation if performed for each document received from the
+ target - before merging takes place. The result of a TF-compuation is
+ added to the TF-total of a cluster. Thus, if a document occurs twice,
+ then the TF-part is doubled. That, however, can be adjusted, because the
+ TF-part may be divided by the number of documents in a cluster.
+ </para>
+ <para>
+ The algorithm used by Pazpar2 has two phases. In phase one
+ Pazpar2 computes a tf-array .. This is being done as records are
+ fetched form the database. In this case, the rank weigth
+ <literal>w</literal>, the and rank tweaks <literal>lead</literal>,
+ <literal>follow</literal> and <literal>length</literal>.
+
+ </para>
+ <screen><![CDATA[
+ tf[1,2,..N] = 0;
+ foreach document in a cluster
+ foreach field
+ w[1,2,..N] = 0;
+ for i = 1, .. N: (each term)
+ foreach pos (where term i occurs in field)
+ // w is configured weight for field
+ // pos is position of term in field
+ w[i] += w / (1 + log2(1+lead*pos))
+ if (d > 0)
+ w[i] += w[i] * follow / (1+log2(d)
+ // length: length of field (number of terms that is)
+ if (length strategy is "linear")
+ tf[i] += w[i] / length;
+ else if (length strategy is "log")
+ tf[i] += w[i] / log2(length);
+ else if (length strategy is "none")
+ tf[i] += w[i];
+ ]]></screen>
+ <para>
+ In phase two, the idf-array is computed and the final score
+ is computed. This is done for each cluster as part of each show command.
+ The rank tweak <literal>cluster</literal> is in use here.
+ </para>
+ <screen><![CDATA[
+ // dococcur[i]: number of records where term occurs
+ // doctotal: number of records
+ for i = 1, .., N (each term)
+ if (dococcur[i] > 0)
+ idf[i] = log(1 + doctotal / dococcur[i])
+ else
+ idf[i] = 0;
+
+ relevance = 0;
+ for i = 1, .., N: (each term)
+ if (cluster is "yes")
+ tf[i] = tf[i] / cluster_size;
+ relevance += 100000 * tf[i] / idf[i];
+ ]]></screen>
+ </section> <!-- relevance_ranking -->
+
+ <section id="masterkey_connect">
+ <title>Pazpar2 and MasterKey Connect</title>
+ <para>
+ MasterKey Connect is a hosted connector, or gateway, service that exposes
+ whatever searchable resources you need. Since the service exposes all
+ resources using Z39.50 (or SRU), it is easy to set up Pazpar2 to use the
+ service. In particular, since all connectors expose basically the same core
+ behavior, it is a good use of Pazpar2's mechanism for managing default
+ behaviors across similar databases.
+ </para>
+ <para>
+ After installation of Pazpar2, the directory
+ <filename>/etc/pazpar2/settings/mkc</filename> (location may
+ vary depending on installation preferences) contains an example setup that
+ searches two different resources through a MasterKey Connect demo account.
+ The file mkc.xml contains default parameters that will work for all
+ MasterKey Connect resources (if you decide to become a customer of the
+ service, you will substitute your own account credentials for
+ the guest/guest). The other files contain specific information about
+ a couple of demonstration resources.
+ </para>
+
+ <para>
+ To play with the demo, just create a symlink from
+ <filename>/etc/pazpar2/services-enabled/default.xml</filename>
+ to <filename>/etc/pazpar2/services-available/mkc.xml</filename>.
+ And restart Pazpar2. You should now be able to search the two demo
+ resources using JSDemo or any user interface of your choice.
+ If you are interested in learning more about MasterKey Connect, or to
+ try out the service for free against your favorite online resource, just
+ contact us at <email>info@indexdata.com</email>.
+ </para>
+ </section>
</chapter> <!-- Using Pazpar2 -->