[Yazlist] Error in latest SICONV.C

Adam Dickmeiss adam at indexdata.dk
Thu Mar 22 09:04:54 CET 2007


Gary Anderson wrote:
> Adam,
> 
> I've run through the latest siconv.c from your cvs repository.  It now 
> has additional problems with converting UTF8 to MARC8.  I ran a 
> simplified test.  The input string was (hex bytes, ignore spaces):
> 
> 61 E8 87 BA  E7 81 A3  E5 BE 97
> 
> The output string was:
> 
> 61 1B 24 31 21 54 2B 21 49 43

That might very well be with your test code, which I unfortunately don't 
know anything about.

But I think I can guess what goes on: you did not flush the output 
buffers with yaz_iconv(cd, 0,0, ..) .

Why is _separate_ flushing a good idea? What can't we call iconv once?

Suppose we did not have it (implicit flushing for every call of iconv) 
and we want to convert a file. We read chunks from disk (say 2K at a 
time). We call iconv. iconv flushes every time. But when it flushes it 
assumes that the input is complete. Sometimes it ain't. There will be 
input sequences "crossing" block boundaries. And so we'll end up with 
errors. So we just read the whole in memory? No? Can mmap do it?

The opposite is much more appealing . We read chunks of almost size we 
want. We don't care about the input.  We don't *know* the input. So we 
call iconv for each chunk . It's a streaming process. At EOF we flush.

Of course, if it's a small and *complete* buffer in memory it might be 
ackward.. We'd have two calls instead of one. But I am sure you can make 
a wrapper yaz_iconv_all(cd, a, b, c, d); which calls yaz_iconv(cd, a, b, 
c, d) +     yaz_iconv(cd, 0, 0, c, d);
This is what your patch effectively does , I think.

/ Adam

> Notice that only the first two hangul triples were converted and 
> output.  The ending ESC string was also not output.  This happened 
> because the normal flow of the logic does NOT call your flush function.  
> In fact in siconv.c version $Id: siconv.c,v 1.37 2007/03/20 21:37:32 
> adam Exp $, yaz_iconv exits at the point shown below.
> 
>    while (1)
>    {
>        unsigned long x;
>        size_t no_read;
> 
>        if (cd->unget_x)
>        {
>            x = cd->unget_x;
>            no_read = cd->no_read_x;
>        }
>        else
>        {
>            if (*inbytesleft == 0)
>            {
>                r = *inbuf - inbuf0;
>                break;    <-------------------This is the exit point.
>            }
> 
> I think what you really meant to do was this:
>    while (1)
>    {
>        unsigned long x;
>        size_t no_read;
> 
>        if (cd->unget_x)
>        {
>            x = cd->unget_x;
>            no_read = cd->no_read_x;
>        }
>        else
>        {
>            if (*inbytesleft == 0)
>            {
>               if (cd->flush_handle)     // Proposed fix to correctly 
> clear the buffers
>                   r = (*cd->flush_handle)(cd, outbuf, outbytesleft);  // 
> Proposed fix to correctly clear the buffers.
>                r = *inbuf - inbuf0;
>                break;    <-------------------This is the exit point.
>            }
> 
> Anyway, I tried the code with this fix, and it works properly.  Is this 
> the correct fix, or should this problem be fixed another way?
> 
> Gary
> 
> _______________________________________________
> Yazlist mailing list
> Yazlist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist




More information about the Yazlist mailing list