[Zebralist] relevancy ranking
marc
marc at indexdata.dk
Wed Jan 31 21:51:46 CET 2007
Daine Mamacos wrote:
> How does the ranking algorithm work in zebra? Does it account for inverse
> document frequency of words and such? Can one apply a vector space model to
> this with the knowledge that a words relevancy is calculated inversely to
> its document frequency?
>
> Thanks
> Daine.
>
There is some information here
http://indexdata.com/zebra/doc/administration-ranking.tkl
One older thread on this:
http://lists.indexdata.dk/pipermail/zebralist/2003-April/000194.html
However, there is no actual info on the algorithm as such. It's in the
source, but not that easily read:
index/rank1.c
So, to add a few comments:
Zebra does boolean queries and searches in specific adressed indexes
(there are inverted indexes pointing from terms in the dictionaly to
documents and term positions inside documents).
On top of this a, a TF/IDF like ranking algorithm has been build.
It works like this:
- first, the boolean query is dismantled into it's principal components,
i.e. atomic queries where one term is looked up in one index.
For example, the query
@and @attr 2=102 @attr 1=1010 Utah @attr 1=1018 Springer
is a boolean AND between the atomic parts
@attr 2=102 @attr 1=1010 Utah
and
@attr 1=1018 Springer
- second, for each atomic query, the hit list of documents is computed
for example, two hit lists for each index @attr 1=1010 and @attr
1=1018 are computed
- third, each document in the hit list is assigned a score (_if_ ranking
is enabled and requested in the query) using a TF/IDF scheme
for example, only the hit list of the index @attr 1=1010 is assigned
scores, as this is the only boolean atomic part which usses the magic
@attr 2=102 directive (which says, relevance ranking used in this part
of the query)
- fourth, the atomic hist lists are merged according to the boolean
conditions to a final hit list of documents to be returned
- fifth, the total score of a document is computed as a linear
combination of the atomic scores of the atomic hit lists
in this example, it's just the scores from one index.
Another example:
@attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 @attr 1=1010 city
makes ranking/scoring in both indexes, where the final score is a
linear combination of the @attr 1=4 index with weight 30 and the
@attr 1=1010 index with weight 20 (the standard default weight is
sqrt(1000) ~ 34 , as the Z39.50 standard prescribes that the top score
is 1000 and the bottom score is 0, encoded in tintegers)
- finally, the final hit list is re-ordered according to scores.
Sounds complicated ?? It is, due to the fact that usual IR ranking only
works on one single full-text index, whereas Zebra is a boolean search
engine working on multiple, often different inverted indexes.
So, the crucial questions yet not answered is:
A) how do we invoke ranking in Zebra ?
B) how does the TD/IDF like algorithm looks like for each atomic query
hit list ?
C) how is the linear combination of atomic scores done to give a total
score ?
Unfortunately, the on-line documentation of the stable zebra release 2.0
does not offer much help:
http://www.indexdata.dk/zebra/doc/
To answer A), B) and C), we have to look at the implementation. Go, and
get the newest CVS tarball at
http://ftp.indexdata.dk/pub/snapshot/idzebra-2.0.10.tar.gz
and have a look at the documentation there, in the doc dir.
file://zebra/doc/administration-ranking.html
A)
This explains how to invoke and configure ranking
B) + C)
These details are found in
zebra/index/rank1.c
in the 'static int calc()' function. We ought to document this better,
but we never made it.
static int calc (void *set_handle, zint sysno, zint staticrank,
int *stop_flag)
{
int i, lo, divisor, score = 0;
struct rank_set_info *si = (struct rank_set_info *) set_handle;
if (!si->no_rank_entries)
return -1; /* ranking not enabled for any terms */
for (i = 0; i < si->no_entries; i++)
{
yaz_log(log_level, "calc: i=%d rank_flag=%d lo=%d",
i, si->entries[i].rank_flag, si->entries[i].local_occur);
if (si->entries[i].rank_flag && (lo = si->entries[i].local_occur))
score += (8+log2_int (lo)) * si->entries[i].global_inv *
si->entries[i].rank_weight;
}
divisor = si->no_rank_entries * (8+log2_int
(si->last_pos/si->no_entries));
score = score / divisor;
yaz_log(log_level, "calc sysno=" ZINT_FORMAT " score=%d", sysno,
score);
if (score > 1000)
score = 1000;
/* reset the counts for the next term */
for (i = 0; i < si->no_entries; i++)
si->entries[i].local_occur = 0;
return score;
}
where lo = si->entries[i].local_occur is the local documents
term-within-index frequency, si->entries[i].global_inv represents the
IDF part (computed in static void *begin()), and
si->entries[i].rank_weight is the weight assigner per index (default 34,
or set in the @attr 9=xyz magic)
Finally, the IDF part is computed as:
static void *begin (struct zebra_register *reg,
void *class_handle, RSET rset, NMEM nmem,
TERMID *terms, int numterms)
{
struct rank_set_info *si =
(struct rank_set_info *) nmem_malloc (nmem,sizeof(*si));
int i;
yaz_log(log_level, "rank-1 begin");
si->no_entries = numterms;
si->no_rank_entries = 0;
si->nmem=nmem;
si->entries = (struct rank_term_info *)
nmem_malloc (si->nmem, sizeof(*si->entries)*numterms);
for (i = 0; i < numterms; i++)
{
zint g = rset_count(terms[i]->rset);
yaz_log(log_level, "i=%d flags=%s '%s'", i,
terms[i]->flags, terms[i]->name );
if (!strncmp (terms[i]->flags, "rank,", 5))
{
const char *cp = strstr(terms[i]->flags+4, ",w=");
si->entries[i].rank_flag = 1;
if (cp)
si->entries[i].rank_weight = atoi (cp+3);
else
si->entries[i].rank_weight = 34; /* sqrroot of 1000 */
yaz_log(log_level, " i=%d weight=%d g="ZINT_FORMAT, i,
si->entries[i].rank_weight, g);
(si->no_rank_entries)++;
}
else
si->entries[i].rank_flag = 0;
si->entries[i].local_occur = 0; /* FIXME */
si->entries[i].global_occur = g;
si->entries[i].global_inv = 32 - log2_int (g);
yaz_log(log_level, " global_inv = %d g = " ZINT_FORMAT,
(int) (32-log2_int (g)), g);
si->entries[i].term = terms[i];
si->entries[i].term_index=i;
terms[i]->rankpriv = &(si->entries[i]);
}
return si;
}
where g = rset_count(terms[i]->rset) is the count of all documents in
this specific index hit list, and the IDF part then is
si->entries[i].global_inv = 32 - log2_int (g);
I hope this answers your question well enough.
Please feel free to inspect the tarball and ask further questions.
Your's Marc Cromme, Index Data
>
> _______________________________________________
> Zebralist mailing list
> Zebralist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/zebralist
>
--
Marc Cromme
M.Sc and Ph.D in Mathematical Modelling and Computation
Senior Developer, Project Manager
Index Data Aps
Købmagergade 43, 2
1150 Copenhagen K.
Denmark
tel: +45 3341 0100
fax: +45 3341 0101
http://www.indexdata.com
INDEX DATA Means Business
for Open Source and Open Standards
More information about the Zebralist
mailing list