Upload
dugan
View
21
Download
1
Embed Size (px)
DESCRIPTION
Sample text http://wt.jrc.it/lt/Acquis/. Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene. This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union. - PowerPoint PPT Presentation
Citation preview
Faculté des Lettes, Département de Linguistique
FipsRomanian: Towards a Romanian Version of the Fips Syntactic ParserVioleta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare LATL – Language Technology Laboratory {violeta.seretan, eric.wehrli, luka.nerima, [email protected]}
Lexicon construction• list of headwords (DEX, 1998)• morphological generation: given a base word form, generates all its forms according to the appropriate inflection paradigm
• manual and semi-automatic insertion• manual insertion for verbs (specific information: subcategorization, selectional features, thematic function, …)
• Current status:– simple entries:
60K lexemes/ 380K words (10 K proper nouns)
– complex entries: multi-word expressions (compounds and collocations): de jur împrejurul “around” problemă – a se pune “problem – to arise”
Grammar implementation • Specifications (Soare, 2005)• Customisation of FipsRomanian grammar for
standard operations (syntactic transformations: relativization, interrogation, passivization, ...)
• Similarities and differences. Examples:– clitic system
– wh-fronting
• Attachment rules: constraints on the main parser operation, Merge, which combines two adjacent structures into a larger structure
• Current status: about 100 rules specified; nearly half implemented and tested
Extending Fips to Romanian: two main tasksRomanian language
Sample text
http://wt.jrc.it/lt/Acquis/
This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union.
Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene.
Orthography• phonemic; Latin alphabet (since 1859)• Diacritics: ă/ə, â/ɨ, î/ɨ; cedilla: ş/ʃ, ţ/ʦ
Morphology• Case system inherited from Latin
nominative-accusative, genitive-dative, vocative• Three grammatical genders
masculine, feminine, neuter • Rich declension of determiners, nouns, adjectives, and verbs
e.g., about 35 forms for a verb• The definite article is enclitic, i.e., suffixed to nouns and adjectives:
casă/house – casa/house-themare/big – marea/big-the
Vocabulary• Latin origin (fundamental vocabulary)• Slavic origin• Neologisms: French, Italian, …• Loanwords: Turkish, Greek, Hungarian, Albanian, ...
Syntax• VSO language, relatively free word order
Europe - Romance languages
ReferencesBresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford.Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass.Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In
Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 7), pages 65–76, Groningen, Holland.
1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest.Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph.D. thesis, University of
Geneva.Soare, G. 2005. Romanian syntax. Technical report, University of Geneva.Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep
Linguistic Processing, pages 120–127, Prague, Czech Republic.
Related work & Useful resources
Task-based evaluation• Collocation extraction from parsed data (Seretan, 2008)• Collocations are half idioms (of encoding, but not of decoding) • Used by parser and in-house rule-based machine translation system• Precision for top 2000 results: 30.3% (Precision for French data: 65.9%, top 500 results)
Parsing experiment• data: journalistic texts, 1.05M words • average sentence length: 26.9 tokens
• 16.2% full parses (FipsFrench, FipsEnglish: about 80%)• average partial parses length : 5.3 tokens• unknown words: 6.5% (of which 39.2% proper nouns)• satisfactory lexical coverage• grammatical coverage needs to be improved (work in progress!)
Preliminary results
Sample collocations extracted Lexicon interface
Screen captures
Fips interface
POS-tagging output
parsing output
FipsRomanian: Sample results
direct object
subject
predicate
Fips: a multilingual parsing architecture (Wehrli, 2007)
Output• Rich sentence representation:
– constituent structure– predicate-argument table– co-indexation chains– intra-sentential pronoun resolution
Underlying theory• Generative Grammar (Chomsky, 1995)
Similarities: • Simpler Syntax (Culicover and Jackendoff, 2005)• Lexical Functional Grammar (Bresnan, 2001)
Implementation• Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information • Language-independent core + language-specific implementation• Component Pascal, OOP paradigm, BlackBox IDE• Supported languages: French, English, German, Spanish, Italian, Greek; others in progress
Sample parse tree produced by Fips
• Data-driven dependency parser for Romanian based on the MaltParser, learns dependencies from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and grammatical coverage (simple structures, no subordination, average sentence length only 9 words).
• Sketch Engine for Romanian: shallow parsing (POS patterns), http://www.sketchengine.co.uk/• Dependency treebank construction, work in progress at the University of Iaşi, Romania• Text processing webservices, RACAI – Research Institute for Artificial Intelligence, Romanian
Academy, Bucarest, Romania. http://www.racai.ro/webservices/TextProcessing.aspx• A repository of tools for Romanian: ConsILR - Consortium for the Romanian Language:
Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http://consilr.info.uaic.ro/