[Zebralist] indexing unicode in zebra

Paul POULAIN paul.poulain at free.fr
Mon Feb 25 14:33:37 CET 2008

Paul POULAIN a écrit :
> Hello,
> I've a database containing MARC datas in utf-8, that have several 
> scripts/languages : greek, chinese, japanese, hindi... (virtually all 
> unicode chararcters in fact)
OK, I tried just to change charmap by icuchain, by using :
<icu_chain locale="en">
   <transform rule="[:Control:] Any-Remove"/>
   <tokenize rule="l"/>
   <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
   <casemap rule="l"/>

and it SEEMS to be better (from far). Although I can't be sure it's 100% 
OK as it's only a 400 records test DB and I don't speak chinese or 
japanese to check results are meaningful.

The problem is that I strictly don't understand why it works and how to 
tune it...

thx for any enlightment

Expert en Logiciels Libres pour l'info-doc
Tel : 04 91 31 45 19

More information about the Zebralist mailing list