[Yazlist] yaz-marcdump option to convert MARC-8 to "combined UTF-8"
Larry E. Dixson
ldix at loc.gov
Fri Dec 14 20:38:35 CET 2007
Tim,
I am not going to be much help to you. We use yaz-marcdump to
convert _MARC_ records to and from UTF-8 and MARC-8. We happen
to prefer the decomposed characters for our products and projects.
I think your real question is -- "how can one convert UTF-8
characters in MARC records from UTF-8 decomposed to UTF-8
precomposed?" -- Is that correct?
I have a colleague who has a tool to do just that for a Web
project. He is away from the office today, but I will find
out more about that and let you know next week.
Have a good weekend.
Larry
--------------------------------------------------------------
Source File = 3 UTF-8 MARC records
--------------------------------------------------------------
3,452 tscottu8.mrc [3 UTF-8 records]
yaz-marcdump -f UTF-8 -t MARC-8 -o marc -l 9=32 tscottu8.mrc >tscottm8.mrc
3,501 tscottm8.mrc [resulting MARC-8 file]
3,452 tscottu8.mrc
yaz-marcdump -f UTF-8 -t UTF-8 -o marcxml tscottu8.mrc >tscott.xml
9,981 tscott.xml [resulting MARCXML file]
3,501 tscottm8.mrc
3,452 tscottu8.mrc
On Fri, 14 Dec 2007, Tim Scott wrote:
> Thank you for that Larry. I can now construct the command line to
> theoretically convert the MARCXML back to ISO2709, but when I tried with
> my 3 record file, it failed.
>
> My commands are:-
>
> yaz-marcdump -f MARC-8 -t UTF-8 -o marcxml -l 9=97 !src! >>!xml!
> 2>!rpt!
> yaz-marcdump -v -i marcxml -o marc !xml! >!dst! 2>>!rpt!
>
> Looking at the "XML", it is not XML because it does not have an
> encapsulating tag, instead it repeats the 'record' tag. Looking at the
> Schema* it would appear that the file should be enclosed in
> <collection>..</collection>.
>
> So, I added commands to encapsulate the file accordingly:-
> echo ^<collection^>>!xml!
> yaz-marcdump -f MARC-8 -t UTF-8 -o marcxml -l 9=97 !src! >>!xml!
> 2>!rpt!
> echo ^</collection^>>!xml!
>
> but all I get from yaz-marcdump is:
> yaz_marc_read_xml failed
>
> It would appear that yaz-marcdump does not write or read the
> <collection> tag but without it I get, quite rightly:-
> <filename>:<line>: parser error : Extra content at the end of
> the document
> <record xmlns="http://www.loc.gov/MARC21/slim">
>
> Can anyone offer advice ?
>
> Thanks very much in advance and Happy Holidays to everybody.
>
> Regards,
> Tim
>
> * http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd
>
> On Thu, 13 Dec 2007, Larry Dixson wrote:
>
> Tim,
> I have a meeting in a few minutes, but attached is a more recent version
> of the yaz-marcdump man page. This will help to answer your question
> about option -l (in the case you cited -- changing
> Leader/09 to an "a" (decimal 97). You will also see the possible values
> for option -o.
>
> Hope that's somewhat helpful.
> Larry
>
> On Thu, 13 Dec 2007, Tim Scott wrote:
>
> > Hi,
> >
> > I'm wondering if there's a way to use yaz-marcdump to produce from a
> > MARC-8 ISO2709 file a UTF-8 encoded MARC21 file without the diacritics
>
> > simply becoming combining characters?
> >
> > As I wrote this, I thought maybe by using an intermediate XML file and
>
> > then some other post-processor, and then reproducing the ISO2709
> again.
> >
> > Off I hunted and found 'charlint.pl' and 'UnicodeData.txt'
> > http://dev.w3.org/cvsweb/charlint/
> > ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt
> > respectivley...
> >
> > .. but found that charlint.pl complained about the UnicodeData.txt
> > file:-
> > [snip]
> > Reading data file, line 9000
> > Reading data file, line 10000
> > Problem with data file consistency, line 10478:
> > 9FBB;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;.
> > ...
> >
> > A quick search on Google was particularly fruitless, but I'm off to
> > try harder tomorrow.
> >
> > I then wondered how I get the ISO2709 back again from the XML result,
> > so I tried converting the marcxml that yaz-marcdump produced, eg:
> > yaz-marcdump -f MARC-8 -t UTF-8 -o marcxml -l 9=97 iso2709file
> > >xmlfile
> > ... xmlfile looks OK
> > yaz-marcdump -o marc -l 9=97 xmlfile >output
> >
> > I've no idea what the "-l 9=97" does, I got this command from another
> > forum.
> >
> > The best manual page I could find for yaz-marcdump was on a French
> > site
> > at:
> > http://pwet.fr/man/linux/commandes/yaz_marcdump
> >
> > .. and it doesn't appear to give me the answer.
> >
> > Is there a man page or something that would give me the options for
> > yaz-marcdump to achieve either the whole thing or just the last XML 2
> > ISO2709 part ?
> >
> > <OT> Has anyone got a working UnicodeData.txt [link] ? </OT>
> >
> > Thanks,
> > Tim
> >
> > cc: Data Exchange file
>
>
> ------------------------------------------------------------
> Larry E. Dixson Internet: ldix at loc.gov
> Network Development and MARC
> Standards Office, LA327
> Library of Congress Telephone: (202) 707-5807
> Washington, D.C. 20540-4402 Fax: (202) 707-0115
>
> _______________________________________________
> Yazlist mailing list
> Yazlist at lists.indexdata.dk
> http://lists.indexdata.dk/cgi-bin/mailman/listinfo/yazlist
>
More information about the Yazlist
mailing list