[Zebralist] relevancy ranking

marc marc at indexdata.dk
Wed Jan 31 21:51:46 CET 2007


Daine Mamacos wrote:
> How does the ranking algorithm work in zebra? Does it account for inverse
> document frequency of words and such? Can one apply a vector space model to
> this with the knowledge that a words relevancy is calculated inversely to
> its document frequency?
> 
> Thanks
> Daine.
> 

There is some information here
http://indexdata.com/zebra/doc/administration-ranking.tkl

One older thread on this:
http://lists.indexdata.dk/pipermail/zebralist/2003-April/000194.html

However, there is no actual info on the algorithm as such. It's in the 
source, but not that easily read:

index/rank1.c

So, to add a few comments:

Zebra does boolean queries and searches in specific adressed indexes 
(there are inverted indexes pointing from terms in the dictionaly to 
documents and term positions inside documents).

On top of this a, a TF/IDF like ranking algorithm has been build.

It works like this:
- first, the boolean query is dismantled into it's principal components,
   i.e. atomic queries where one term is looked up in one index.

   For example, the query

    @and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer

    is a boolean AND between the atomic parts

    @attr 2=102 @attr 1=1010 Utah
    and
    @attr 1=1018 Springer

- second, for each atomic query, the hit list of documents is computed

    for example, two hit lists for each index  @attr 1=1010  and  @attr
    1=1018 are computed

- third, each document in the hit list is assigned a score (_if_ ranking
   is enabled and requested in the query)  using a TF/IDF scheme

   for example, only the hit list of the index @attr 1=1010 is assigned
   scores, as this is the only boolean atomic part which usses the magic
   @attr 2=102 directive (which says, relevance ranking used in this part
   of the query)

- fourth, the atomic hist lists are merged according to the boolean
   conditions to a final hit list of documents to be returned

- fifth, the total score of a document is computed as a linear
   combination of the atomic scores of the atomic hit lists

   in this example, it's just the scores from one index.

   Another example:

    @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city

   makes ranking/scoring in both indexes, where the final score is a
   linear combination of the @attr 1=4 index with weight 30 and the
   @attr 1=1010 index with weight 20  (the standard default weight is
   sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score
   is 1000 and the bottom score is 0, encoded in tintegers)



- finally, the final hit list is re-ordered according to scores.

Sounds complicated ?? It is, due to the fact that usual IR ranking only 
works on one single full-text index, whereas Zebra is a boolean search 
engine working on multiple, often different inverted indexes.

So, the crucial questions yet not answered is:

A) how do we invoke ranking in Zebra ?

B) how does the TD/IDF like algorithm looks like for each atomic query 
hit list ?

C) how is the linear combination of atomic scores done to give a total 
score ?

Unfortunately, the on-line documentation of the stable zebra release 2.0 
does not offer much help:
http://www.indexdata.dk/zebra/doc/


To answer A), B) and C), we have to look at the implementation. Go, and 
get the newest CVS tarball at

http://ftp.indexdata.dk/pub/snapshot/idzebra-2.0.10.tar.gz

and have a look at the documentation there, in the doc dir.

file://zebra/doc/administration-ranking.html

A)
This explains how to invoke and configure ranking

B) + C)
These details are found in

zebra/index/rank1.c

in the 'static int calc()' function. We ought to document this better, 
but we never made it.


static int calc (void *set_handle, zint sysno, zint staticrank,
                  int *stop_flag)
{
     int i, lo, divisor, score = 0;
     struct rank_set_info *si = (struct rank_set_info *) set_handle;

     if (!si->no_rank_entries)
         return -1;   /* ranking not enabled for any terms */

     for (i = 0; i < si->no_entries; i++)
     {
         yaz_log(log_level, "calc: i=%d rank_flag=%d lo=%d",
                 i, si->entries[i].rank_flag, si->entries[i].local_occur);
         if (si->entries[i].rank_flag && (lo = si->entries[i].local_occur))
             score += (8+log2_int (lo)) * si->entries[i].global_inv *
                 si->entries[i].rank_weight;
     }
     divisor = si->no_rank_entries * (8+log2_int 
(si->last_pos/si->no_entries));
     score = score / divisor;
     yaz_log(log_level, "calc sysno=" ZINT_FORMAT " score=%d", sysno, 
score);
     if (score > 1000)
         score = 1000;
     /* reset the counts for the next term */
     for (i = 0; i < si->no_entries; i++)
         si->entries[i].local_occur = 0;
     return score;
}


where lo = si->entries[i].local_occur is the local documents 
term-within-index frequency, si->entries[i].global_inv represents the 
IDF part (computed in static void *begin()), and
si->entries[i].rank_weight is the weight assigner per index (default 34, 
or set in the @attr 9=xyz magic)

Finally, the IDF part is computed as:

static void *begin (struct zebra_register *reg,
                     void *class_handle, RSET rset, NMEM nmem,
                     TERMID *terms, int numterms)
{
     struct rank_set_info *si =
         (struct rank_set_info *) nmem_malloc (nmem,sizeof(*si));
     int i;

     yaz_log(log_level, "rank-1 begin");
     si->no_entries = numterms;
     si->no_rank_entries = 0;
     si->nmem=nmem;
     si->entries = (struct rank_term_info *)
         nmem_malloc (si->nmem, sizeof(*si->entries)*numterms);
     for (i = 0; i < numterms; i++)
     {
         zint g = rset_count(terms[i]->rset);
         yaz_log(log_level, "i=%d flags=%s '%s'", i,
                 terms[i]->flags, terms[i]->name );
         if  (!strncmp (terms[i]->flags, "rank,", 5))
         {
             const char *cp = strstr(terms[i]->flags+4, ",w=");
             si->entries[i].rank_flag = 1;
             if (cp)
                 si->entries[i].rank_weight = atoi (cp+3);
             else
               si->entries[i].rank_weight = 34; /* sqrroot of 1000 */
             yaz_log(log_level, " i=%d weight=%d g="ZINT_FORMAT, i,
                      si->entries[i].rank_weight, g);
             (si->no_rank_entries)++;
         }
         else
             si->entries[i].rank_flag = 0;
         si->entries[i].local_occur = 0;  /* FIXME */
         si->entries[i].global_occur = g;
         si->entries[i].global_inv = 32 - log2_int (g);
         yaz_log(log_level, " global_inv = %d g = " ZINT_FORMAT,
                 (int) (32-log2_int (g)), g);
         si->entries[i].term = terms[i];
         si->entries[i].term_index=i;
         terms[i]->rankpriv = &(si->entries[i]);
     }
     return si;
}


where g = rset_count(terms[i]->rset) is the count of all documents in 
this specific index hit list, and the IDF part then is

  si->entries[i].global_inv = 32 - log2_int (g);


I hope this answers your question well enough.


Please feel free to inspect the tarball and ask further questions.


Your's Marc Cromme, Index Data



> 
> _______________________________________________
> Zebralist mailing list
> Zebralist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist
> 


-- 

Marc Cromme
M.Sc and Ph.D in Mathematical Modelling and Computation
Senior Developer, Project Manager

Index Data Aps
Købmagergade 43, 2
1150 Copenhagen K.
Denmark

tel: +45 3341 0100
fax: +45 3341 0101

http://www.indexdata.com

INDEX DATA Means Business
for Open Source and Open Standards







More information about the Zebralist mailing list