[Net-z3950] Re: Diacritics display (was Net::Z3950 v0.46 problem)

Doran, Michael D doran at uta.edu
Mon Dec 6 23:15:36 CET 2004


Hi Yehui,

> I believe it's not a problem of displaying, since I
> checked the octal values and they are not the same.

The octal values of what characters?  What were the octal values?  And not the same as what?

> I tried to use yaz-client to retrieve the MARC record as Mike 
> suggested, but I got exactly the same result as Mike got, the
> French characters are not displaying correctly.

I've not seen your data, but if I understand what's happening in Mike Taylor's YAZ session log, I believe the results *were* correct even if they looked strange.  Let me explain...  (Please excuse me if you were already aware of all this, and I've missed the crux of the problem.)

> > 245 10 $a Extr{E3}eme-Occident
(Note: I've replaced the non-ASCII character with its underlying hex value and placed it in brackets)

In the above 245 field, the first word of the title should be rendered as "Extra{latin small letter e with circumflex}me".  In MARC-8 [1], the combining character for circumflex is hex E3, and precedes the base character that is being modified (in this case, an "e").  The "Latin small letter e with circumflex" doesn't display the way we want to in most environments (such as the command line on a server) because we are viewing it through the "lens" of a different character set.  For instance, when viewed through the lens of Latin-1 (ISO-8859-1), hex E3 appears as "Latin small letter a with tilde" (i.e. "ã"), and the first title word looks like this: Extrãeme.  There are very few applications (mostly integrated library management systems) that have a rendering engine that will properly display MARC-8 combining characters [2].

> > Z> charset latin1 latin1
> > Character set negotiation : latin1

My guess is that the "Z> charset latin1 latin1" command didn't have the desired effect for a couple of reasons: according to the YAZ client documentation, character set negotiation should occur before the session is opened, and it doesn't apply to MARC records, regardless [3].

> > Z> marccharset latin1
> > Z> show 1

According to the YAZ client documentation, the "marccharset" command "Specifies character set for retrieved MARC records so that YAZ client can display them in a character suitable for your display."  I'm not sure if this implies a translation, or if it is merely asking the Z server to output in Latin-1 (for instance) *if* that is an available option from the server's standpoint.  Maybe somebody more knowledgeable will join the discussion.

Even if that *did* work, keep in mind that a character set translation is not necessarily a desirable thing -- by using combining diacritic characters, MARC-8 is capable of encoding thousands of possible characters, while Latin-1 by virtue of using precomposed diacritic characters is limited to a repertoire of less than 256 characters.  In other words, not all MARC-8 characters have a Latin-1 equivalent and you risk losing information in the conversion process.  My recommendation, if you choose to do a conversion to Latin-1, is to store the MARC records in your MySQL database encoded in the original MARC-8, and do the Latin-1 conversion on the fly for ephemeral display (e.g. on the web).  A better bet would probably be a conversion of the records to Unicode (UTF-8) with Ed Summers' MARC::Charset Perl module [4].

Whatever route you chose, I'm sure you are aware that in order to maintain database integrity within your MySQL database, there must be consistency regarding the character set encoding for the MARC record data internal to your database.  If you import a mix of records (e.g. Unicode, MARC-8, Latin-1 encodings) you will have a heck of a time trying to straighten it out.

[1] MARC-21 records can be encoded in either the MARC-8 character set or the UCS/Unicode character set.  I'm not very knowledgeable about Z39.50 servers, but I believe that most ILMS implementations will provide records in MARC-8 character encoding.

[2] MARC-8 Environment: NONSPACING GRAPHIC CHARACTERS (DIACRITICS)
    http://www.loc.gov/marc/specifications/speccharmarc8.html#nonspace

[3] Index Data > YAZ > YAZ User's Guide and Reference > Commands
    http://www.indexdata.dk/yaz/doc/client.commands.tkl
    charset negotiationcharset [ outputcharset ]
	 NOTE - Since character set negotation takes effect
	 in the Z39.50 Initialize Request you should issue
	 this command before command open is used.
	 NOTE - MARC records are not covered by Z39.50
	 character set negotiation. See marccharset.

[4] MARC::Charset - A module for doing MARC-8/UTF8 translation
    http://marcpm.sourceforge.net/MARC/Charset.html

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# doran at uta.edu
# http://rocky.uta.edu/doran/ 

> -----Original Message-----
> From: Yehui Zhang [mailto:yxz at press.uchicago.edu] 
> Sent: Monday, December 06, 2004 2:06 PM
> To: Doran, Michael D
> Cc: net-z3950 at indexdata.com
> Subject: RE: [Net-z3950] Re: a Net::Z3950 v0.46 problem
> 
> Hi Michael,
> 
> We only noticed the problem recently. And I believe it's not a problem of 
> displaying, since I checked the octal values and they are not the same. I 
> tried to use yaz-client to retrieve the MARC record as Mike suggested, but 
> I got exactly the same result as Mike got, the French characters are not 
> displaying correctly. Do you have any idea how this can be fixed?
> 
> Thank you very much for you help!
> Yehui Zhang
> --------------------
> Bibliovault Programmer
> University of Chicago Press
> (773)-702-9436
> 
> At 12:25 PM 12/6/2004 -0600, Doran, Michael D wrote:
> >Hi Yehui,
> >
> > > However we run into a problem recently. Some of the French
> > > characters are not appearing correctly after it's imported
> >
> >Do you mean 1) that you only noticed the problem recently, or 2) that 
> >diacritics of previously harvested and imported records appeared 
> >correctly, and you have only recently begun to have the problem of 
> >corrupted data?  If it's #2, what has changed since the process was working OK?
> >
> > > ... we are still getting the corrupted data.
> >
> >I assume that the character set of the harvested records is MARC-8.  By 
> >"corrupted" do you mean that 1) the characters do not display correctly, 
> >or 2) that you have verified that the underlying encoding is incorrect 
> >(i.e. by examining the hex values)?  If it's #1, what application are you 
> >using to display the records?
> >
> >-- Michael
> >
> ># Michael Doran, Systems Librarian
> ># University of Texas at Arlington
> ># 817-272-5326 office
> ># 817-688-1926 cell
> ># doran at uta.edu
> ># http://rocky.uta.edu/doran/
> >
> > > -----Original Message-----
> > > From: net-z3950-bounces at indexdata.dk
> > > [mailto:net-z3950-bounces at indexdata.dk] On Behalf Of Mike Taylor
> > > Sent: Monday, December 06, 2004 9:29 AM
> > > To: yxz at press.uchicago.edu
> > > Cc: erg at press.uchicago.edu; net-z3950 at indexdata.com
> > > Subject: [Net-z3950] Re: a Net::Z3950 v0.46 problem
> > >
> > > > Date:       Fri, 03 Dec 2004 13:04:32 -0600
> > > > From:       Yehui Zhang <yxz at press.uchicago.edu>
> > > >
> > > > We have been using Z3950 module to harvest MARC records 
> to our MySQL
> > > > database. However we run into a problem recently. Some 
> of the French
> > > > characters are not appearing correctly after it's imported using
> > > > function rawdata() or render(). I installed the newest 
> version of
> > > > Z3950 module which supports charset and language 
> options, and set
> > > > the charset to ISO-8859-1(and tried several other 
> charset/language
> > > > options), but we are still getting the corrupted data.
> > >
> > > Hi, Yehui.  I've had only a brief look at this, and it's not
> > > immediately obvious to me what's going on.  As I am sure you know,
> > > character-set issues are delicate at the best of times, and there
> > > isn't really enough information here to let me solve the problem.
> > >
> > > It would be helpful if you could persaude the YAZ 
> command-line client
> > > ("yaz-client") to retrieve and display this MARC record correctly.
> > > Despite some messing about with the "charset" and "marccharset"
> > > commands, I still have not managed this.  (See attached 
> script).  If
> > > you can do so, I will be in a position to offer better 
> advice; if you
> > > can't, then the problem may be at the server's end.
> > >
> > >  _/|_ 
> _______________________________________________________________
> > > /o ) \/  Mike Taylor  <mike at indexdata.com>
> > > http://www.miketaylor.org.uk
> > > )_v__/\  "But what is it good for?" -- Engineer at the Advanced
> > >       Computing Systems Division of IBM, 1965 commenting on the
> > >       microchip.
> > >
> > > --
> > >
> > > $ yaz-client z3950.loc.gov:7090/Voyager
> > > Connecting...OK.
> > > Sent initrequest.
> > > Connection accepted by v3 target.
> > > ID     : 34
> > > Name   : Voyager LMS - Z39.50 Server (YAZ Proxy)
> > > Version: 1.13
> > > Options: search present
> > > Elapsed: 0.374033
> > > Z> find @attr 1=7 0226510646
> > > Sent searchRequest.
> > > Received SearchResponse.
> > > Search was a success.
> > > Number of hits: 1
> > > records returned: 0
> > > Elapsed: 0.163926
> > > Z> show 1
> > > Sent presentRequest (1+1).
> > > Records: 1
> > > [VOYAGER]Record type: USmarc
> > > 001 3468598
> > > 005 19941101162839.5
> > > 008 930209s1993    ilu      b    001 0 eng
> > > 035    $9 (DLC)   93016456
> > > 906    $a 7 $b cbc $c orignew $d 1 $e ocip $f 19 $g y-gencatlg
> > > 955    $a pc to la00 02-09-93; lf08 02-09-93; lh06 02-10-93
> > > (PQ143.U6 M...); lj04 02-11-93; aa00 02-12-93; CIP ver. lg15/
> > > lg06 11-01-93
> > > 010    $a    93016456
> > > 020    $a 0226510638
> > > 020    $a 0226510646 (pbk.)
> > > 040    $a DLC $c DLC $d DLC
> > > 043    $a n-us--- $a e-fr---
> > > 050 00 $a PQ143.U6 $b M37 1993
> > > 082 00 $a 840.9/3273 $2 20
> > > 100 1  $a Mathy, Jean-Philippe.
> > > 245 10 $a Extrãeme-Occident : $b French intellectuals and
> > > America / $c Jean-Philippe Mathy.
> > > 260    $a Chicago : $b University of Chicago Press, $c c1993.
> > > 300    $a ix, 307 p. ; $c 24 cm.
> > > 504    $a Includes bibliographical references (p. 
> 289-297) and index.
> > > 650  0 $a French literature $y 20th century $x History 
> and criticism.
> > > 651  0 $a United States $x Foreign public opinion, French.
> > > 651  0 $a France $x Intellectual life $y 20th century.
> > > 650  0 $a French literature $x American influences.
> > > 651  0 $a America $x In literature.
> > > 991    $b c-GenColl $h PQ143.U6 $i M37 1993 $p 00012523695 $t
> > > Copy 1 $w BOOKS
> > > nextResultSetPosition = 2
> > > Elapsed: 0.131160
> > > Z> charset latin1 latin1
> > > Character set negotiation : latin1
> > > Z> show 1
> > > Sent presentRequest (1+1).
> > > Records: 1
> > > [VOYAGER]Record type: USmarc
> > > 001 3468598
> > > 005 19941101162839.5
> > > 008 930209s1993    ilu      b    001 0 eng
> > > 035    $9 (DLC)   93016456
> > > 906    $a 7 $b cbc $c orignew $d 1 $e ocip $f 19 $g y-gencatlg
> > > 955    $a pc to la00 02-09-93; lf08 02-09-93; lh06 02-10-93
> > > (PQ143.U6 M...); lj04 02-11-93; aa00 02-12-93; CIP ver. lg15/
> > > lg06 11-01-93
> > > 010    $a    93016456
> > > 020    $a 0226510638
> > > 020    $a 0226510646 (pbk.)
> > > 040    $a DLC $c DLC $d DLC
> > > 043    $a n-us--- $a e-fr---
> > > 050 00 $a PQ143.U6 $b M37 1993
> > > 082 00 $a 840.9/3273 $2 20
> > > 100 1  $a Mathy, Jean-Philippe.
> > > 245 10 $a Extrãeme-Occident : $b French intellectuals and
> > > America / $c Jean-Philippe Mathy.
> > > 260    $a Chicago : $b University of Chicago Press, $c c1993.
> > > 300    $a ix, 307 p. ; $c 24 cm.
> > > 504    $a Includes bibliographical references (p. 
> 289-297) and index.
> > > 650  0 $a French literature $y 20th century $x History 
> and criticism.
> > > 651  0 $a United States $x Foreign public opinion, French.
> > > 651  0 $a France $x Intellectual life $y 20th century.
> > > 650  0 $a French literature $x American influences.
> > > 651  0 $a America $x In literature.
> > > 991    $b c-GenColl $h PQ143.U6 $i M37 1993 $p 00012523695 $t
> > > Copy 1 $w BOOKS
> > > nextResultSetPosition = 2
> > > Elapsed: 0.120964
> > > Z> marccharset latin1
> > > Z> show 1
> > > Sent presentRequest (1+1).
> > > Records: 1
> > > [VOYAGER]Record type: USmarc
> > > convert from latin1 to latin1
> > > 001 3468598
> > > 005 19941101162839.5
> > > 008 930209s1993    ilu      b    001 0 eng
> > > 035    $9 (DLC)   93016456
> > > 906    $a 7 $b cbc $c orignew $d 1 $e ocip $f 19 $g y-gencatlg
> > > 955    $a pc to la00 02-09-93; lf08 02-09-93; lh06 02-10-93
> > > (PQ143.U6 M...); lj04 02-11-93; aa00 02-12-93; CIP ver. lg15/
> > > lg06 11-01-93
> > > 010    $a    93016456
> > > 020    $a 0226510638
> > > 020    $a 0226510646 (pbk.)
> > > 040    $a DLC $c DLC $d DLC
> > > 043    $a n-us--- $a e-fr---
> > > 050 00 $a PQ143.U6 $b M37 1993
> > > 082 00 $a 840.9/3273 $2 20
> > > 100 1  $a Mathy, Jean-Philippe.
> > > 245 10 $a Extrãeme-Occident : $b French intellectuals and
> > > America / $c Jean-Philippe Mathy.
> > > 260    $a Chicago : $b University of Chicago Press, $c c1993.
> > > 300    $a ix, 307 p. ; $c 24 cm.
> > > 504    $a Includes bibliographical references (p. 
> 289-297) and index.
> > > 650  0 $a French literature $y 20th century $x History 
> and criticism.
> > > 651  0 $a United States $x Foreign public opinion, French.
> > > 651  0 $a France $x Intellectual life $y 20th century.
> > > 650  0 $a French literature $x American influences.
> > > 651  0 $a America $x In literature.
> > > 991    $b c-GenColl $h PQ143.U6 $i M37 1993 $p 00012523695 $t
> > > Copy 1 $w BOOKS
> > > nextResultSetPosition = 2
> > > Elapsed: 0.119284
> > > Z>
> > >
> > >
> > >
> > > _______________________________________________
> > > Net-z3950 mailing list
> > > Net-z3950 at indexdata.dk
> > > http://www.indexdata.dk/mailman/listinfo/net-z3950
> > >
> 
> 





More information about the Net-z3950 mailing list