[Yazlist] yaz-marcdump option to convert MARC-8 to "combined UTF-8"

Larry E. Dixson ldix at loc.gov
Fri Dec 14 20:38:35 CET 2007


Tim,
I am not going to be much help to you.  We use yaz-marcdump to
convert _MARC_ records to and from UTF-8 and MARC-8.  We happen
to prefer the decomposed characters for our products and projects.

I think your real question is -- "how can one convert UTF-8
characters in MARC records from UTF-8 decomposed to UTF-8
precomposed?" -- Is that correct?

I have a colleague who has a tool to do just that for a Web
project.  He is away from the office today, but I will find
out more about that and let you know next week.

Have a good weekend.
Larry

--------------------------------------------------------------
Source File = 3 UTF-8 MARC records
--------------------------------------------------------------
    3,452 tscottu8.mrc   [3 UTF-8 records]
 
yaz-marcdump -f UTF-8 -t MARC-8 -o marc -l 9=32 tscottu8.mrc >tscottm8.mrc

    3,501 tscottm8.mrc   [resulting MARC-8 file]
    3,452 tscottu8.mrc

yaz-marcdump -f UTF-8 -t UTF-8 -o marcxml tscottu8.mrc >tscott.xml

    9,981 tscott.xml     [resulting MARCXML file]
    3,501 tscottm8.mrc
    3,452 tscottu8.mrc


On Fri, 14 Dec 2007, Tim Scott wrote:

> Thank you for that Larry. I can now construct the command line to
> theoretically convert the MARCXML back to ISO2709, but when I tried with
> my 3 record file, it failed.
> 
> My commands are:-
> 
>   yaz-marcdump -f MARC-8 -t UTF-8 -o marcxml -l 9=97 !src! >>!xml!
> 2>!rpt!
>   yaz-marcdump -v -i marcxml -o marc !xml! >!dst! 2>>!rpt!
> 
> Looking at the "XML", it is not XML because it does not have an
> encapsulating tag, instead it repeats the 'record' tag. Looking at the
> Schema* it would appear that the file should be enclosed in
> <collection>..</collection>.
> 
> So, I added commands to encapsulate the file accordingly:-
>   echo ^<collection^>>!xml!
>   yaz-marcdump -f MARC-8 -t UTF-8 -o marcxml -l 9=97 !src! >>!xml!
> 2>!rpt!
>   echo ^</collection^>>!xml!
> 
> but all I get from yaz-marcdump is:
> 	yaz_marc_read_xml failed
> 
> It would appear that yaz-marcdump does not write or read the
> <collection> tag but without it I get, quite rightly:-
> 	<filename>:<line>: parser error : Extra content at the end of
> the document
> 	<record xmlns="http://www.loc.gov/MARC21/slim">
> 
> Can anyone offer advice ?
> 
> Thanks very much in advance and Happy Holidays to everybody.
> 
> Regards,
> Tim
> 
> * http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd
> 
> On Thu, 13 Dec 2007, Larry Dixson wrote:
> 
> Tim,
> I have a meeting in a few minutes, but attached is a more recent version
> of the yaz-marcdump man page.  This will help to answer your question
> about option -l (in the case you cited -- changing
> Leader/09 to an "a" (decimal 97).  You will also see the possible values
> for option -o.
> 
> Hope that's somewhat helpful.
> Larry
> 
> On Thu, 13 Dec 2007, Tim Scott wrote:
> 
> > Hi,
> >  
> > I'm wondering if there's a way to use yaz-marcdump to produce from a
> > MARC-8 ISO2709 file a UTF-8 encoded MARC21 file without the diacritics
> 
> > simply becoming combining characters?
> >  
> > As I wrote this, I thought maybe by using an intermediate XML file and
> 
> > then some other post-processor, and then reproducing the ISO2709
> again.
> >  
> > Off I hunted and found 'charlint.pl' and 'UnicodeData.txt'
> >     http://dev.w3.org/cvsweb/charlint/
> >     ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
> >         respectivley...
> >  
> > .. but found that charlint.pl complained about the UnicodeData.txt
> > file:-
> > [snip]
> > Reading data file, line 9000
> > Reading data file, line 10000
> > Problem with data file consistency, line 10478:
> >         9FBB;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;.
> > ...
> >  
> > A quick search on Google was particularly fruitless, but I'm off to 
> > try harder tomorrow.
> >  
> > I then wondered how I get the ISO2709 back again from the XML result, 
> > so I tried converting the marcxml that yaz-marcdump produced, eg:
> >     yaz-marcdump -f MARC-8 -t UTF-8 -o marcxml -l 9=97 iso2709file
> > >xmlfile
> >         ... xmlfile looks OK
> >     yaz-marcdump -o marc -l 9=97 xmlfile >output
> >  
> > I've no idea what the "-l 9=97" does, I got this command from another 
> > forum.
> >  
> > The best manual page I could find for yaz-marcdump was on a French 
> > site
> > at:
> >     http://pwet.fr/man/linux/commandes/yaz_marcdump
> >  
> > .. and it doesn't appear to give me the answer.
> >  
> > Is there a man page or something that would give me the options for 
> > yaz-marcdump to achieve either the whole thing or just the last XML 2
> > ISO2709 part ?
> >  
> > <OT> Has anyone got a working UnicodeData.txt [link] ? </OT>
> >  
> > Thanks,
> > Tim
> >  
> > cc: Data Exchange file
> 
> 
> ------------------------------------------------------------
> Larry E. Dixson                    Internet:    ldix at loc.gov
> Network Development and MARC
>    Standards Office, LA327
> Library of Congress                Telephone: (202) 707-5807
> Washington, D.C.  20540-4402       Fax:       (202) 707-0115
> 
> _______________________________________________
> Yazlist mailing list
> Yazlist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
> 




More information about the Yazlist mailing list