Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library...

Preview:

Citation preview

Handling Chinese-Language Bibliographic Data

A North American Perspective

Honk Kong Library Association, August 22nd, 2003Clément Arsenault, assistant professor

École de bibliothéconomie et des sciences de l’informationUniversité de Montréal

©2003, Clément Arsenault 2

Overview

Multilingual / Multiscript Information Systems

Transliteration, Transcription and Romanization

Romanization Systems for Chinese

Transliteration in Bibliographic Records

Word Division

Parsing Chinese Text

Word Division for Bibliographic Control

An Retrieval Experiment

©2003, Clément Arsenault 3

Multilingual / Multiscript Info. Systems

Integrate several languages and/or Several scripts

10 major scripts, to write ~95% of all languages

Japanese– Hiragana あいうえお…– Katakana アイウエオ…

Korean 가각갂갃간…Chinese 甲乙丙丁…

Romanabcdeéèêœ …

Greek αβγδε …Cyrillic авгдеж …Hebrew אבגדה …Arabic ث ت ا ب …

ح جIndic (11) …अआइईउऊThai …กขฃคฅฆ

©2003, Clément Arsenault 4

Multilingual / Multiscript Info. Systems

System contains records representing items in more than one language

System contains records that are, in total or in part, in more than one language

The system interface is in more than one language• System prompts• Command / query language

The system is able to display text in more than one script The system allows the end user to build queries in more

than one script

©2003, Clément Arsenault 5

Multilingual / Multiscript Info. Systems

Non-Roman data in North American OPACs

Stored? Displayed?

yes

no

yes

no

Indexed?yes

no

Romanization Vernacular

Cata

log

uin

gR

etr

ieval

©2003, Clément Arsenault 6

Chinese Language: Some Facts

Number of characters• 9,353 in 1st century C.E.• 47,043 in 1716 ( 康熙字典 )• ~60,000 in 1990 ( 漢語大字典 )

Occurrence• 1,000 characters 90%• 2,400 characters 99%• 3,800 characters 99.9%• 6,600 characters 99.999%

©2003, Clément Arsenault 7

Romanization Systems for Chinese

What is Transliteration?• Script conversion

– Transliteration:– Transcription:

script scriptsound script

What is Romanization?• Converting a script to the Roman script

Romanizing Chinese script• Only transcription is possible

©2003, Clément Arsenault 8

What sounds?• Vast number of regionalects / dialects• Standard is Mandarin (based on Beijing)

cha — Northern

zo — Suzhou

dzo — Wenzhou

te — Xiamen (Amoy)

tssa — Guangzhou (Canton)

茶Romanization Systems for Chinese

©2003, Clément Arsenault 9

What sounds?• Then how to render it… ?

Romanization Systems for Chinese

chunch’untchuntchwuntchountchounne...

©2003, Clément Arsenault 10

Historical overview• Fanqie method (early 1st millennium)

• 烃 = 土 + 丁 (tu + ding)

• Matteo Ricci & Father Nicolas Trigault (17th cent.)• Hundreds of schemes developed since then

• Mostly developed by Westerners• Wade (English) Wade-Giles (English/American)• EFEO (French)• Yale (American)• Lessing-Othmer (German)• …

Romanization Systems for Chinese

©2003, Clément Arsenault 11

Historical overview• Systems developed by Chinese

• Gwoyeu Romatzyh Pinin Faashyh (1928)• Beifangxua Latinxua Sin Wenz (1931)• Hanyu pinyin fang’an (1956)

Romanizing Chinese for bibliographic Control in North America

• Wade-Giles (through October 2000)• Pinyin (After October 2000)

Romanization Systems for Chinese

©2003, Clément Arsenault 12

Wade-Giles vs Pinyin

Example: 唐宋全诗 Wade-Giles: T‘ang2 Sung4 ch‘üan2 shih1

Pinyin: Táng Sòng quán shī

Romanization Systems for Chinese

Wade-Giles Used mostly in English-speaking

countries Was used until 2000 at LC (and

mainly in NA libraries) Rarely used in teaching anymore Heavy use of punctuation and

diacritics

Pinyin Used internationally

Used for many years in libraries in Europe and Australia

Used for teaching Minimal use of punctuation and

diacritics

©2003, Clément Arsenault 13

Transliteration in Bibliographic Records

Is transliteration necessary / useful? Necessary for oral and written communications

All 川崎 models come fully equipped.

©2003, Clément Arsenault 14

Transliteration in Bibliographic Records

Is transliteration necessary / useful? Necessary for oral and written communications

All 川崎 models come fully equipped. All Kawasaki models come fully equipped.

©2003, Clément Arsenault 15

Transliteration in Bibliographic Records

Is there a need for Romanized fields in bibliographic records?

• In printed records?• In electronic records?

A special case for Chinese• Three major obstacles

• Filing: difficult to browse Chinese characters• Data entry: users need to Romanize anyway • 25% of sources in Roman only (Anderson 1972)

©2003, Clément Arsenault 16

Transliteration in Bibliographic Records

Filing Chinese characters• Number of strokes• Semantic roots

• Then, number of strokes

• Based on shape• 4-corners method

• By sound• Romanization… (A–Z)

• Simplest and fastest method

©2003, Clément Arsenault 17

Data entry• Keyboards (more than 700 methods)

• Special keyboards• QWERTY or AZERTY keyboards

– orthographic-based methods– phonetic-based methods

• Special devices• OCR• Pressure sensitive tablets• Voice recognition…

Transliteration in Bibliographic Records

©2003, Clément Arsenault 18

Word Division

Chinese is written without word delimiters

多接近大自然總是不錯的,因為人是從大自然而來的。

But Romanized Chinese could/should be…Duo jie jin da zi ran zong shi bu cuo de, yin wei ren shi cong da zi ran er lai de.

Duo jiejin daziran zongshi bucuo de, yinwei ren shi cong daziran erlai de.

©2003, Clément Arsenault 19

Reasons for delimiting Romanized Chinese• Syllabic structure is too simple for efficient retrieval

• ~1300 single syllables (~400 base syllables)• “mā, má, mǎ, mà” indexed as “ma”

• Single syllables• High level of ambiguity (homophones)• Ambiguous 8 times out of 9• Readability is almost nil

• Joined syllables• Resolves ~95% of ambiguity cases (King 1983)• Greatly improves readability

Word Division

©2003, Clément Arsenault 20

But, no consistent rules… 中國話

• Zhong guo hua

• Zhong-guo hua

• Zhongguo hua

• Zhongguohua

Word Division

©2003, Clément Arsenault 21

Parsing Text

What is a word?• Visual word• Semantic/syntactic word...

Often based on conventions Not always consistent (in Google, 4 Aug. 2003)

– earring (461,000) ear ring (18,200)– shoemaker (465,000) shoe maker (21,100)– bottleneck (419,000) bottle neck (32,800)– firefighter (687,000) fire fighter (121,000)– flowerpot (42,100) flower pot (54,300)

©2003, Clément Arsenault 22

Word Division for Bibliographic Control

1997: LC announces change to Pinyin Use monosyllabic or polysyllabic transcription?

Monosyllabic division Polysyllabic division

• Consistent • Increases recall • Lowers precision • Easier to convert from

existing Wade-Giles • Easier to generate

Romanization from a string of Chinese characters

• Difficult to be consistent

• Lack of established standard • The proper format according to Hanyu

pinyin fang’an (the PRC pinyin standard) • Represents the nature of the language • Improves readablity when browsing • Improves precision • More effective in voice recognition / text-

to-speech implementations

©2003, Clément Arsenault 23

A Retrieval Experiment Experiment designed to test the

effect of syllable aggregation on retrieval

Part of a Doctoral Thesis at University of Toronto

©2003, Clément Arsenault 24

Statement of the Problem

Conversion to pinyin (1st Oct. 2000–1st Oct. 2001) No inclusion of tones Text division (syllable aggregation)

• Monosyllabic for common words (e.g., 东西 dong xi)• Polysyllabic for proper words (e.g., 上海 Shanghai)

Consequences: Two different methods used together

• Confusing!!! Only ~400 index “terms” available for all common

words• Too few!!!

©2003, Clément Arsenault 25

Statement of the Problem

Conversion from Wade-Giles to Pinyin• Convert to monosyllabic?• Convert to polysyllabic?

Potential impact on retrieval / browsing Measure impact on retrieval

• Effectiveness (success in finding records)• Efficiency (effort spent to find them, i.e., time)

©2003, Clément Arsenault 26

Research Questions

Determine if using polysyllabic pinyin entries, over monosyllabic pinyin entries, in bibliographic records improves retrieval effectiveness and efficiency in known-item exact-title searches.

Determine if using polysyllabic pinyin entries, over monosyllabic pinyin entries, in bibliographic records improves retrieval effectiveness and efficiency in known-item keywords-in-title searches.

©2003, Clément Arsenault 27

Research Questions

In other words What is the effect of aggregation patterns on…

Six variables were defined Six hypotheses

Exact-title Keywords

Efficiency Q1 Q3

Effectiveness Q2 Q4

©2003, Clément Arsenault 28

Research Questions

Definitions Exact-title search mode (with implied truncation)

Request for “Gone with the wind”

QUERY: “gone with the”

Keyword search modeRequest for “Gone with the wind”

QUERY: “wind” AND “gone”

©2003, Clément Arsenault 29

Hypotheses

Effect of using polysyllabic transcription over monosyllabic Predictions

Efficiency Effectiveness Phrase Keywords Phrase Keywords

©2003, Clément Arsenault 30

Methodology

Retrieval task:• Search 2 lists of 20 titles (in Chinese characters) using:

– Wade-Giles Romanization (WG)– Pinyin-monosyllabic Romanization (mPY)– Pinyin-polysyllabic Romanization (pPY)

• Replicate using two search modes:– Exact-title searching (phrase matching)– Keyword searching

• Measure:– Time to complete task (efficiency)– Number of items/records found (effectiveness)

©2003, Clément Arsenault 31

Methodology: sampling

Purposive sample of 30 students• Graduate students• Native speakers of Chinese• Good working knowledge of Romanization

Each participant was given $20 CAN 30 participants × 2 tasks = 60 trials

©2003, Clément Arsenault 32

Methodology: design and procedures

My main statistical design was a 2 × 3 randomized factorial design with unbalanced proportional data. Participants were replicated over factor A.

BA WG mPY pPY

X-title 6 12 12KW 6 12 12

BA WG mPY pPY

X-title µ11 µ12 µ13

KW µ21 µ22 µ23

©2003, Clément Arsenault 33

Methodology: apparatus

20 titles × 2 lists = 40 titles 3 databases of ca. 50K records (RLIN db)

• WG / mPY / pPY Databases running on Microsoft Access Interface in HTML format accessed with Web

browser ASP links interface to database and records

transaction logs

©2003, Clément Arsenault 34

Methodology: apparatus

Titles to be searchedID-

number

1. 颤栗 / 蒋伯潜 ____ — ____

2. 盐山新志:河北省 / 汪美瑞 ____ — ____

3. 生死场 / 顾宝民 ____ — ____4. 西藏那曲地区土地资源 / 施其明 ____ — ____

©2003, Clément Arsenault 35

Transaction Log Analysis (TLA)

Components of TLA

Database

Logging Program

Inte

rfac

e

Logs

End-user

Methodology: data collection

©2003, Clément Arsenault 36

Interaction with external software components

Internet

Internet Information ServerASP

Scripts

Win NT

SQL Server

HTML Files

OD

B C

ADO

ASPWWW

Methodology: data collection

©2003, Clément Arsenault 37

Internet

Internet Information ServerASP

Scripts

Win NT

SQL Server

HTML Files

OD

BC

ADO

ASPWWW

DatabaseLogging Program

Inte

rfa

ce

Logs

End-user

TLA

ASP

Methodology: data collection

©2003, Clément Arsenault 38

Generated logs

Methodology: data collection

©2003, Clément Arsenault 39

Statistical analysis

Exact-title KeywordsWG /mPY

WG /pPY

mPY /pPY

WG /mPY

WG /pPY

mPY /pPY

Completion time — —Time/item found — Expected Search Length — — — — Number of queries — — — —Success rate — — — —Success rate per query — — — — — —

Effi

cie

ncy

Eff

ecti

ven

ess

©2003, Clément Arsenault 40

Results Aggregation improves search efficiency for title

searches Keyword search is especially influenced by

aggregation

Keyword search is especially important for Chinese titles since it is not unusual that the pronunciation of one of the first characters in the title is unknown

x-title Keywords

Efficiency Q1 Q3

Effectiveness

Q2 Q4

©2003, Clément Arsenault 41

Results

Using mono- and polysyllabic aggregation concurrently is a great source of confusion to end-users

Retrieval with Romanization works relatively well

Success rates for know-item searches vary between 72% and 91% depending on Romanization system used

©2003, Clément Arsenault 42

Results

However, for a non-negligible proportion of end users, Romanization-based retrieval poses real problems

Around 25% of the participants made between 50 and 80 Romanization errors during the retrieval task

0

1

2

3

4

5

6

7

0–4 10–14 20–24 30–34 40–44 50–54 60–64 70–74 80–84

Number of Errors

Num

ber

of P

arti

cipan

ts

©2003, Clément Arsenault 43

Results Cause of errors

Aggregation• A1: Two unlinked units were linked

(e.g., dong xi / dongxi)

• A2: One linked unit was unlinked(e.g., Shanghai / Shang hai)

Romanization• R1: Character was misread

(e.g., 粟 su / 栗 li)

• R2: Romanization was misspelled(e.g., 林 was written ling instead of lin)

©2003, Clément Arsenault 44

Results

Aggregation and Romanization errors

0,20,30,40,50,60,70,80,9

WG mPY pPY

(av

era

ge

pe

r q

ue

ry) Aggregation errors

Romanization errors

©2003, Clément Arsenault 45

Results Wade-Giles notation, more “forgiving”

Wade-Giles Pinyin

chen / ch’en chen / zhen

chi / ch’i ji / qi

chu / ch’u / chü / ch’ü zhu / chu / ju / qu

… …

©2003, Clément Arsenault 46

Results Analysis of Romanization errors

Romanization• R1: character misread (e.g., 粟 su /

栗 li)

• R2: Romanization misspelled• chen / cheng (dental nasal vs. velar nasal)• cu / zu (voiced fricatives vs. unvoiced fricatives)• hu / fu (glottal fricative vs. labiodental fricative)• la / na (alveolar lateral vs. alveolar nasal)

©2003, Clément Arsenault 47

Results

Phonetic confusion

0

0,05

0,1

0,15

0,2

Fricatives Nasals Other

(avera

ge p

er

qu

ery

)

©2003, Clément Arsenault 48

Conclusions

1. In KW mode, polysyllabic entries help improve precision

2. More aggregation errors in polysyllabic, but overall not overwhelming

3. Dual aggregation format is confusing to end-users (important source of error)

4. Still relatively high proportion of errors caused by confusion in Romanization

©2003, Clément Arsenault 49

Further research Project #1

Using vernacular script for retrieval• What model???

Query DataInput

Romanization

Other input method

Romanization

汉字

Romanization

汉字

©2003, Clément Arsenault 50

Further research Project #2

Using XML to encode non-Roman bibliographic data

• What is the viability of XML as a conversion format for bibliographic records containing non-Roman data?

• How can we use existing conversion schema, for instance those developed at LC?

• Does XML offer the required flexibility for publishing non-Roman on the Web with enhanced retrieval capabilities?

©2003, Clément Arsenault 51

Further research

©2003, Clément Arsenault 52

Further research Benefits

Integration of resources created under a decentralized environment

Creation of specialized retrieval tools adapted to the specific nature of the data

Increased visibility for resources in non-Roman alphabets

©2003, Clément Arsenault 53

Questions Thank you!

Recommended