[Yazlist] Fields ending in combining diacritics

Adam Dickmeiss adam at indexdata.dk
Fri Mar 9 09:40:50 CET 2007


Adam Dickmeiss wrote:
> Gary Anderson wrote:
>> I am not sure how this will help.  In the application, the last 2 
>> bytes of the data string are oxea and 0x1e - the diacritic and the 
>> record mark.  yaz_iconv seems to drop the diacritic because it doesn't 
>> have a trailing character, but it does process the record mark.  What 
>> I need is something that will tell me that this case has occurred.  It 
>> looks to me like yaz just drops the diacritic.
> I don't see a way the iconv interface could tell you this. I'm still a 
> little confused, so forgive me for asking,.. what is the behavior you 
> want? (keep the diacritic?)
> 

In case you want *not* to keep the diacritic, in other words you are 
asking to be notified about an error .. then maybe it's best to use 
EINVAL because the iconv man page says:

"EINVAL An  incomplete  multibyte  sequence  has been encountered in the 
input."

Case 1:
So if you pass
    .. 0xEA
you get EINVAL because no characters follow 0xEA (as far as iconv is 
concerned).

Case 2: If you pass

    .. 0xEA 0x1E
that would not return an error. In fact YAZ currently converts this UTF-8:
       0x1E 0xCC 0x8A
because 0x1E is just a "character".

Unfortunately for case 1, YAZ currently returns 'unknown error'. That's 
no good. This has been fixed in the CVS version of YAZ.

/ Adam


> / Adam
> 
>>
>> My checking indicates that on completion of conversion of the record 
>> mark, the yaz_iconv library is left in its 'initial state'.  The next 
>> string converts just fine.
>> Gary
>>
>> Adam Dickmeiss wrote:
>>
>>> Gary Anderson wrote:
>>>
>>>> I am using the siconv interface.  I have a programmatic process that 
>>>> deals with very large files of records.
>>>>
>>>> Adam Dickmeiss wrote:
>>>>
>>>>> Gary Anderson wrote:
>>>>>
>>>>>> I recently ran some tests using records from the National Library 
>>>>>> of Canada.  Of the 600,000+ records in their name and subject 
>>>>>> authority file, six records had 670 tags where the subfield a data 
>>>>>> ended in a combining diacritic character with no following character.
>>>>>>
>>>>>> Submitting that data string 
>>>>>> (indicators+subfieldmark+subfieldcode+data+fieldmark) to siconvert 
>>>>>> resulted in an output string that did not contain the diacritic 
>>>>>> character.  It was dropped.  The field mark character was 
>>>>>> retained.  Can you suggest a means for notifying the caller when 
>>>>>> this condition occurs?  Byte counts don't really work because UTF8 
>>>>>> is one side or the other of the conversion transaction.
>>>>>>
>>>>>> The ending diacritic values were:  0xE2, 0xE5, 0xE8, 0xEA, and 0xF6.
>>>>>
>>>
>>> I think you need to do is to "flush" reset to the "initial state". 
>>> The flush would take place after a field or subfield ends.
>>>
>>> That's done by iconv and, hopefully, yaz_iconv by setting inbuf or 
>>> *inbuf to NULL, but outbut to non-NULL, i.e.
>>>
>>> yaz_iconv(cd, 0, 0, &outbuf, &outbytesleft);
>>>
>>> From 'man 3 iconv':
>>> "
>>> A different case is when inbuf is NULL or *inbuf is NULL, but outbuf is
>>> not NULL and *outbuf is not NULL. In this case,  the  iconv()  function
>>> attempts  to set cd's conversion state to the initial state and store a
>>> corresponding shift sequence at *outbuf.  At most *outbytesleft  bytes,
>>> starting at *outbuf, will be written.  If the output buffer has no more
>>> room for this reset sequence,  it  sets  errno  to  E2BIG  and  returns
>>> (size_t)(-1).  Otherwise  it  increments  *outbuf  and decrements *out-
>>> bytesleft by the number of bytes written.
>>> "
>>>
>>> Use YAZ 2.1.48 or later for this to work.
>>>
>>> / Adam
>>>
>>>>>>
>>>>> Did you use yaz-marcdump for the conversion?
>>>>>
>>>>> Or did you do something else ? (such as programming towards the 
>>>>> siconv interface)?
>>>>>
>>>>> / Adam
>>>>>
>>>>>> Thanks
>>>>>> Gary
>>>>>>
>>>>>> _______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlist at lists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlist at lists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>>>
>>>>
>>>> _______________________________________________
>>>> Yazlist mailing list
>>>> Yazlist at lists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>
>>>
>>>
>>> _______________________________________________
>>> Yazlist mailing list
>>> Yazlist at lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist at lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
> 
> 
> _______________________________________________
> Yazlist mailing list
> Yazlist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
> 
> 




More information about the Yazlist mailing list