[Yazlist] Fields ending in combining diacritics

Gary Anderson ganderson at bslw.com
Mon Mar 12 18:15:41 CET 2007


Adam,
The specific case I am dealing with is a 670 field containing only 1 
subfield a that ends with text like:  houses built above 50<degree 
mark><field mark>.  Obviously, the providing library has used the wrong 
character for the degree mark.  They used 0xea which is a combining 
Angstrom diacritic when they should have used 0xc0 - the degree mark.  
The initial records are in MARC8 encoding.  When I run the translation 
for this, I end up with no errors, but the diacritic character now 
follows the field mark.  What I am interested in is a way for the siconv 
library to catch this situation, since applying a diacritic to a control 
character  should not be allowed behavior. 

In this vein, maybe you can enlighten me on a question - I am running a 
tag by tag conversion on records that I process.  Originally, I was not 
including the field mark character in the string sent to siconv.  I 
found, however, that there were some cases where the conversion state 
was left indeterminate, so I began to include the field mark in the 
input string.  That seems to have fixed all of the other problems except 
this one.  What is your recommended practice for converting records?  
Should I be including the field mark or not?

Gary

Adam Dickmeiss wrote:

> Adam Dickmeiss wrote:
>
>> Gary Anderson wrote:
>>
>>> I am not sure how this will help.  In the application, the last 2 
>>> bytes of the data string are oxea and 0x1e - the diacritic and the 
>>> record mark.  yaz_iconv seems to drop the diacritic because it 
>>> doesn't have a trailing character, but it does process the record 
>>> mark.  What I need is something that will tell me that this case has 
>>> occurred.  It looks to me like yaz just drops the diacritic.
>>
>> I don't see a way the iconv interface could tell you this. I'm still 
>> a little confused, so forgive me for asking,.. what is the behavior 
>> you want? (keep the diacritic?)
>>
>
> In case you want *not* to keep the diacritic, in other words you are 
> asking to be notified about an error .. then maybe it's best to use 
> EINVAL because the iconv man page says:
>
> "EINVAL An  incomplete  multibyte  sequence  has been encountered in 
> the input."
>
> Case 1:
> So if you pass
>    .. 0xEA
> you get EINVAL because no characters follow 0xEA (as far as iconv is 
> concerned).
>
> Case 2: If you pass
>
>    .. 0xEA 0x1E
> that would not return an error. In fact YAZ currently converts this 
> UTF-8:
>       0x1E 0xCC 0x8A
> because 0x1E is just a "character".
>
> Unfortunately for case 1, YAZ currently returns 'unknown error'. 
> That's no good. This has been fixed in the CVS version of YAZ.
>
> / Adam
>
>
>> / Adam
>>
>>>
>>> My checking indicates that on completion of conversion of the record 
>>> mark, the yaz_iconv library is left in its 'initial state'.  The 
>>> next string converts just fine.
>>> Gary
>>>
>>> Adam Dickmeiss wrote:
>>>
>>>> Gary Anderson wrote:
>>>>
>>>>> I am using the siconv interface.  I have a programmatic process 
>>>>> that deals with very large files of records.
>>>>>
>>>>> Adam Dickmeiss wrote:
>>>>>
>>>>>> Gary Anderson wrote:
>>>>>>
>>>>>>> I recently ran some tests using records from the National 
>>>>>>> Library of Canada.  Of the 600,000+ records in their name and 
>>>>>>> subject authority file, six records had 670 tags where the 
>>>>>>> subfield a data ended in a combining diacritic character with no 
>>>>>>> following character.
>>>>>>>
>>>>>>> Submitting that data string 
>>>>>>> (indicators+subfieldmark+subfieldcode+data+fieldmark) to 
>>>>>>> siconvert resulted in an output string that did not contain the 
>>>>>>> diacritic character.  It was dropped.  The field mark character 
>>>>>>> was retained.  Can you suggest a means for notifying the caller 
>>>>>>> when this condition occurs?  Byte counts don't really work 
>>>>>>> because UTF8 is one side or the other of the conversion 
>>>>>>> transaction.
>>>>>>>
>>>>>>> The ending diacritic values were:  0xE2, 0xE5, 0xE8, 0xEA, and 
>>>>>>> 0xF6.
>>>>>>
>>>>>>
>>>>
>>>> I think you need to do is to "flush" reset to the "initial state". 
>>>> The flush would take place after a field or subfield ends.
>>>>
>>>> That's done by iconv and, hopefully, yaz_iconv by setting inbuf or 
>>>> *inbuf to NULL, but outbut to non-NULL, i.e.
>>>>
>>>> yaz_iconv(cd, 0, 0, &outbuf, &outbytesleft);
>>>>
>>>> From 'man 3 iconv':
>>>> "
>>>> A different case is when inbuf is NULL or *inbuf is NULL, but 
>>>> outbuf is
>>>> not NULL and *outbuf is not NULL. In this case,  the  iconv()  
>>>> function
>>>> attempts  to set cd's conversion state to the initial state and 
>>>> store a
>>>> corresponding shift sequence at *outbuf.  At most *outbytesleft  
>>>> bytes,
>>>> starting at *outbuf, will be written.  If the output buffer has no 
>>>> more
>>>> room for this reset sequence,  it  sets  errno  to  E2BIG  and  
>>>> returns
>>>> (size_t)(-1).  Otherwise  it  increments  *outbuf  and decrements 
>>>> *out-
>>>> bytesleft by the number of bytes written.
>>>> "
>>>>
>>>> Use YAZ 2.1.48 or later for this to work.
>>>>
>>>> / Adam
>>>>
>>>>>>>
>>>>>> Did you use yaz-marcdump for the conversion?
>>>>>>
>>>>>> Or did you do something else ? (such as programming towards the 
>>>>>> siconv interface)?
>>>>>>
>>>>>> / Adam
>>>>>>
>>>>>>> Thanks
>>>>>>> Gary
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Yazlist mailing list
>>>>>>> Yazlist at lists.indexdata.dk
>>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Yazlist mailing list
>>>>>> Yazlist at lists.indexdata.dk
>>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Yazlist mailing list
>>>>> Yazlist at lists.indexdata.dk
>>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Yazlist mailing list
>>>> Yazlist at lists.indexdata.dk
>>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>>>
>>>
>>> _______________________________________________
>>> Yazlist mailing list
>>> Yazlist at lists.indexdata.dk
>>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>
>>
>>
>> _______________________________________________
>> Yazlist mailing list
>> Yazlist at lists.indexdata.dk
>> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>>
>>
>
>
> _______________________________________________
> Yazlist mailing list
> Yazlist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ganderson.vcf
Type: text/x-vcard
Size: 235 bytes
Desc: not available
Url : http://lists.indexdata.dk/pipermail/yazlist/attachments/20070312/f9bba36c/ganderson.vcf


More information about the Yazlist mailing list