[Oclist] Character set issues

Giannis Kosmas kosmas at lib.uoc.gr
Tue Mar 10 12:48:50 CET 2009


Hi everybody!

We are using quite a lot most of the opencontent databases provided by 
Indexdata as they are very useful in a metasearching environment. It 
seems that there are some character set issues though, at least with the 
databases we tried so far i.e. dmoz, wikipedia and gutenberg.

More specifically, there are problems when someone searches with Greek 
text. Not all the records that come back from the server match the 
search criteria. For example, when I try to search with term "Αλκηστις" 
against gutenberg I expect to get records matching that Greek word but I 
get records with Russian text as well and there is no Greek text 
anywhere in those records. I tried the opposite, searching with 
"История", Russian word for "history" and I got back  records with Greek 
text as well so I believe this happens with queries expressed in a 
script residing outside latin-1. All of my search queries were formed in 
UTF-8.

Another problem appears when a record is requested in iso2709 and MARC8 
character set. All the greek accented letters are not shown at all. The 
records are presented ok when they are requested as marcxml though. I 
hope this helps.

Giannis


-------------- next part --------------
A non-text attachment was scrubbed...
Name: kosmas.vcf
Type: text/x-vcard
Size: 259 bytes
Desc: not available
Url : http://lists.indexdata.dk/pipermail/oclist/attachments/20090310/cf9d829c/attachment.vcf 


More information about the Oclist mailing list