53
Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd , 2003 Clément Arsenault, assistant professor École de bibliothéconomie et des sciences de l’information Université de Montréal

Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

Embed Size (px)

Citation preview

Page 1: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

Handling Chinese-Language Bibliographic Data

A North American Perspective

Honk Kong Library Association, August 22nd, 2003Clément Arsenault, assistant professor

École de bibliothéconomie et des sciences de l’informationUniversité de Montréal

Page 2: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 2

Overview

Multilingual / Multiscript Information Systems

Transliteration, Transcription and Romanization

Romanization Systems for Chinese

Transliteration in Bibliographic Records

Word Division

Parsing Chinese Text

Word Division for Bibliographic Control

An Retrieval Experiment

Page 3: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 3

Multilingual / Multiscript Info. Systems

Integrate several languages and/or Several scripts

10 major scripts, to write ~95% of all languages

Japanese– Hiragana あいうえお…– Katakana アイウエオ…

Korean 가각갂갃간…Chinese 甲乙丙丁…

Romanabcdeéèêœ …

Greek αβγδε …Cyrillic авгдеж …Hebrew אבגדה …Arabic ث ت ا ب …

ح جIndic (11) …अआइईउऊThai …กขฃคฅฆ

Page 4: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 4

Multilingual / Multiscript Info. Systems

System contains records representing items in more than one language

System contains records that are, in total or in part, in more than one language

The system interface is in more than one language• System prompts• Command / query language

The system is able to display text in more than one script The system allows the end user to build queries in more

than one script

Page 5: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 5

Multilingual / Multiscript Info. Systems

Non-Roman data in North American OPACs

Stored? Displayed?

yes

no

yes

no

Indexed?yes

no

Romanization Vernacular

Cata

log

uin

gR

etr

ieval

Page 6: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 6

Chinese Language: Some Facts

Number of characters• 9,353 in 1st century C.E.• 47,043 in 1716 ( 康熙字典 )• ~60,000 in 1990 ( 漢語大字典 )

Occurrence• 1,000 characters 90%• 2,400 characters 99%• 3,800 characters 99.9%• 6,600 characters 99.999%

Page 7: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 7

Romanization Systems for Chinese

What is Transliteration?• Script conversion

– Transliteration:– Transcription:

script scriptsound script

What is Romanization?• Converting a script to the Roman script

Romanizing Chinese script• Only transcription is possible

Page 8: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 8

What sounds?• Vast number of regionalects / dialects• Standard is Mandarin (based on Beijing)

cha — Northern

zo — Suzhou

dzo — Wenzhou

te — Xiamen (Amoy)

tssa — Guangzhou (Canton)

茶Romanization Systems for Chinese

Page 9: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 9

What sounds?• Then how to render it… ?

Romanization Systems for Chinese

chunch’untchuntchwuntchountchounne...

Page 10: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 10

Historical overview• Fanqie method (early 1st millennium)

• 烃 = 土 + 丁 (tu + ding)

• Matteo Ricci & Father Nicolas Trigault (17th cent.)• Hundreds of schemes developed since then

• Mostly developed by Westerners• Wade (English) Wade-Giles (English/American)• EFEO (French)• Yale (American)• Lessing-Othmer (German)• …

Romanization Systems for Chinese

Page 11: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 11

Historical overview• Systems developed by Chinese

• Gwoyeu Romatzyh Pinin Faashyh (1928)• Beifangxua Latinxua Sin Wenz (1931)• Hanyu pinyin fang’an (1956)

Romanizing Chinese for bibliographic Control in North America

• Wade-Giles (through October 2000)• Pinyin (After October 2000)

Romanization Systems for Chinese

Page 12: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 12

Wade-Giles vs Pinyin

Example: 唐宋全诗 Wade-Giles: T‘ang2 Sung4 ch‘üan2 shih1

Pinyin: Táng Sòng quán shī

Romanization Systems for Chinese

Wade-Giles Used mostly in English-speaking

countries Was used until 2000 at LC (and

mainly in NA libraries) Rarely used in teaching anymore Heavy use of punctuation and

diacritics

Pinyin Used internationally

Used for many years in libraries in Europe and Australia

Used for teaching Minimal use of punctuation and

diacritics

Page 13: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 13

Transliteration in Bibliographic Records

Is transliteration necessary / useful? Necessary for oral and written communications

All 川崎 models come fully equipped.

Page 14: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 14

Transliteration in Bibliographic Records

Is transliteration necessary / useful? Necessary for oral and written communications

All 川崎 models come fully equipped. All Kawasaki models come fully equipped.

Page 15: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 15

Transliteration in Bibliographic Records

Is there a need for Romanized fields in bibliographic records?

• In printed records?• In electronic records?

A special case for Chinese• Three major obstacles

• Filing: difficult to browse Chinese characters• Data entry: users need to Romanize anyway • 25% of sources in Roman only (Anderson 1972)

Page 16: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 16

Transliteration in Bibliographic Records

Filing Chinese characters• Number of strokes• Semantic roots

• Then, number of strokes

• Based on shape• 4-corners method

• By sound• Romanization… (A–Z)

• Simplest and fastest method

Page 17: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 17

Data entry• Keyboards (more than 700 methods)

• Special keyboards• QWERTY or AZERTY keyboards

– orthographic-based methods– phonetic-based methods

• Special devices• OCR• Pressure sensitive tablets• Voice recognition…

Transliteration in Bibliographic Records

Page 18: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 18

Word Division

Chinese is written without word delimiters

多接近大自然總是不錯的,因為人是從大自然而來的。

But Romanized Chinese could/should be…Duo jie jin da zi ran zong shi bu cuo de, yin wei ren shi cong da zi ran er lai de.

Duo jiejin daziran zongshi bucuo de, yinwei ren shi cong daziran erlai de.

Page 19: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 19

Reasons for delimiting Romanized Chinese• Syllabic structure is too simple for efficient retrieval

• ~1300 single syllables (~400 base syllables)• “mā, má, mǎ, mà” indexed as “ma”

• Single syllables• High level of ambiguity (homophones)• Ambiguous 8 times out of 9• Readability is almost nil

• Joined syllables• Resolves ~95% of ambiguity cases (King 1983)• Greatly improves readability

Word Division

Page 20: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 20

But, no consistent rules… 中國話

• Zhong guo hua

• Zhong-guo hua

• Zhongguo hua

• Zhongguohua

Word Division

Page 21: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 21

Parsing Text

What is a word?• Visual word• Semantic/syntactic word...

Often based on conventions Not always consistent (in Google, 4 Aug. 2003)

– earring (461,000) ear ring (18,200)– shoemaker (465,000) shoe maker (21,100)– bottleneck (419,000) bottle neck (32,800)– firefighter (687,000) fire fighter (121,000)– flowerpot (42,100) flower pot (54,300)

Page 22: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 22

Word Division for Bibliographic Control

1997: LC announces change to Pinyin Use monosyllabic or polysyllabic transcription?

Monosyllabic division Polysyllabic division

• Consistent • Increases recall • Lowers precision • Easier to convert from

existing Wade-Giles • Easier to generate

Romanization from a string of Chinese characters

• Difficult to be consistent

• Lack of established standard • The proper format according to Hanyu

pinyin fang’an (the PRC pinyin standard) • Represents the nature of the language • Improves readablity when browsing • Improves precision • More effective in voice recognition / text-

to-speech implementations

Page 23: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 23

A Retrieval Experiment Experiment designed to test the

effect of syllable aggregation on retrieval

Part of a Doctoral Thesis at University of Toronto

Page 24: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 24

Statement of the Problem

Conversion to pinyin (1st Oct. 2000–1st Oct. 2001) No inclusion of tones Text division (syllable aggregation)

• Monosyllabic for common words (e.g., 东西 dong xi)• Polysyllabic for proper words (e.g., 上海 Shanghai)

Consequences: Two different methods used together

• Confusing!!! Only ~400 index “terms” available for all common

words• Too few!!!

Page 25: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 25

Statement of the Problem

Conversion from Wade-Giles to Pinyin• Convert to monosyllabic?• Convert to polysyllabic?

Potential impact on retrieval / browsing Measure impact on retrieval

• Effectiveness (success in finding records)• Efficiency (effort spent to find them, i.e., time)

Page 26: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 26

Research Questions

Determine if using polysyllabic pinyin entries, over monosyllabic pinyin entries, in bibliographic records improves retrieval effectiveness and efficiency in known-item exact-title searches.

Determine if using polysyllabic pinyin entries, over monosyllabic pinyin entries, in bibliographic records improves retrieval effectiveness and efficiency in known-item keywords-in-title searches.

Page 27: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 27

Research Questions

In other words What is the effect of aggregation patterns on…

Six variables were defined Six hypotheses

Exact-title Keywords

Efficiency Q1 Q3

Effectiveness Q2 Q4

Page 28: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 28

Research Questions

Definitions Exact-title search mode (with implied truncation)

Request for “Gone with the wind”

QUERY: “gone with the”

Keyword search modeRequest for “Gone with the wind”

QUERY: “wind” AND “gone”

Page 29: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 29

Hypotheses

Effect of using polysyllabic transcription over monosyllabic Predictions

Efficiency Effectiveness Phrase Keywords Phrase Keywords

Page 30: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 30

Methodology

Retrieval task:• Search 2 lists of 20 titles (in Chinese characters) using:

– Wade-Giles Romanization (WG)– Pinyin-monosyllabic Romanization (mPY)– Pinyin-polysyllabic Romanization (pPY)

• Replicate using two search modes:– Exact-title searching (phrase matching)– Keyword searching

• Measure:– Time to complete task (efficiency)– Number of items/records found (effectiveness)

Page 31: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 31

Methodology: sampling

Purposive sample of 30 students• Graduate students• Native speakers of Chinese• Good working knowledge of Romanization

Each participant was given $20 CAN 30 participants × 2 tasks = 60 trials

Page 32: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 32

Methodology: design and procedures

My main statistical design was a 2 × 3 randomized factorial design with unbalanced proportional data. Participants were replicated over factor A.

BA WG mPY pPY

X-title 6 12 12KW 6 12 12

BA WG mPY pPY

X-title µ11 µ12 µ13

KW µ21 µ22 µ23

Page 33: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 33

Methodology: apparatus

20 titles × 2 lists = 40 titles 3 databases of ca. 50K records (RLIN db)

• WG / mPY / pPY Databases running on Microsoft Access Interface in HTML format accessed with Web

browser ASP links interface to database and records

transaction logs

Page 34: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 34

Methodology: apparatus

Titles to be searchedID-

number

1. 颤栗 / 蒋伯潜 ____ — ____

2. 盐山新志:河北省 / 汪美瑞 ____ — ____

3. 生死场 / 顾宝民 ____ — ____4. 西藏那曲地区土地资源 / 施其明 ____ — ____

Page 35: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 35

Transaction Log Analysis (TLA)

Components of TLA

Database

Logging Program

Inte

rfac

e

Logs

End-user

Methodology: data collection

Page 36: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 36

Interaction with external software components

Internet

Internet Information ServerASP

Scripts

Win NT

SQL Server

HTML Files

OD

B C

ADO

ASPWWW

Methodology: data collection

Page 37: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 37

Internet

Internet Information ServerASP

Scripts

Win NT

SQL Server

HTML Files

OD

BC

ADO

ASPWWW

DatabaseLogging Program

Inte

rfa

ce

Logs

End-user

TLA

ASP

Methodology: data collection

Page 38: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 38

Generated logs

Methodology: data collection

Page 39: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 39

Statistical analysis

Exact-title KeywordsWG /mPY

WG /pPY

mPY /pPY

WG /mPY

WG /pPY

mPY /pPY

Completion time — —Time/item found — Expected Search Length — — — — Number of queries — — — —Success rate — — — —Success rate per query — — — — — —

Effi

cie

ncy

Eff

ecti

ven

ess

Page 40: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 40

Results Aggregation improves search efficiency for title

searches Keyword search is especially influenced by

aggregation

Keyword search is especially important for Chinese titles since it is not unusual that the pronunciation of one of the first characters in the title is unknown

x-title Keywords

Efficiency Q1 Q3

Effectiveness

Q2 Q4

Page 41: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 41

Results

Using mono- and polysyllabic aggregation concurrently is a great source of confusion to end-users

Retrieval with Romanization works relatively well

Success rates for know-item searches vary between 72% and 91% depending on Romanization system used

Page 42: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 42

Results

However, for a non-negligible proportion of end users, Romanization-based retrieval poses real problems

Around 25% of the participants made between 50 and 80 Romanization errors during the retrieval task

0

1

2

3

4

5

6

7

0–4 10–14 20–24 30–34 40–44 50–54 60–64 70–74 80–84

Number of Errors

Num

ber

of P

arti

cipan

ts

Page 43: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 43

Results Cause of errors

Aggregation• A1: Two unlinked units were linked

(e.g., dong xi / dongxi)

• A2: One linked unit was unlinked(e.g., Shanghai / Shang hai)

Romanization• R1: Character was misread

(e.g., 粟 su / 栗 li)

• R2: Romanization was misspelled(e.g., 林 was written ling instead of lin)

Page 44: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 44

Results

Aggregation and Romanization errors

0,20,30,40,50,60,70,80,9

WG mPY pPY

(av

era

ge

pe

r q

ue

ry) Aggregation errors

Romanization errors

Page 45: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 45

Results Wade-Giles notation, more “forgiving”

Wade-Giles Pinyin

chen / ch’en chen / zhen

chi / ch’i ji / qi

chu / ch’u / chü / ch’ü zhu / chu / ju / qu

… …

Page 46: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 46

Results Analysis of Romanization errors

Romanization• R1: character misread (e.g., 粟 su /

栗 li)

• R2: Romanization misspelled• chen / cheng (dental nasal vs. velar nasal)• cu / zu (voiced fricatives vs. unvoiced fricatives)• hu / fu (glottal fricative vs. labiodental fricative)• la / na (alveolar lateral vs. alveolar nasal)

Page 47: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 47

Results

Phonetic confusion

0

0,05

0,1

0,15

0,2

Fricatives Nasals Other

(avera

ge p

er

qu

ery

)

Page 48: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 48

Conclusions

1. In KW mode, polysyllabic entries help improve precision

2. More aggregation errors in polysyllabic, but overall not overwhelming

3. Dual aggregation format is confusing to end-users (important source of error)

4. Still relatively high proportion of errors caused by confusion in Romanization

Page 49: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 49

Further research Project #1

Using vernacular script for retrieval• What model???

Query DataInput

Romanization

Other input method

Romanization

汉字

Romanization

汉字

Page 50: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 50

Further research Project #2

Using XML to encode non-Roman bibliographic data

• What is the viability of XML as a conversion format for bibliographic records containing non-Roman data?

• How can we use existing conversion schema, for instance those developed at LC?

• Does XML offer the required flexibility for publishing non-Roman on the Web with enhanced retrieval capabilities?

Page 51: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 51

Further research

Page 52: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 52

Further research Benefits

Integration of resources created under a decentralized environment

Creation of specialized retrieval tools adapted to the specific nature of the data

Increased visibility for resources in non-Roman alphabets

Page 53: Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library Association, August 22 nd, 2003 Clément Arsenault, assistant

©2003, Clément Arsenault 53

Questions Thank you!