1. WP5 Tasks & Deliverables2. Overview of parallel technology tools3. Parallel copora requirements4. Survey of language resources5. Work plan for t4-t146. Questions
Input: parallel corpora produced in WP4 Output: language resources for MT in WP7/WP8
WP5.1 Sub-sentential alignment (DCU, ELDA, ILSP)
WP5.2 Bilingual dictionary extraction (DCU, ILSP)
• D5.1 (t06): Report describing the inventory of parallel technology
tools to be developed and integrated in PANACEA and the
characteristics of the resources to be produced.
• D5.2 (t14) Aligners integrated into the platform, and documentation
• D5.3 (t22) Parallel, sententially aligned texts, cleaned and prepared
for training/building translational models (20—50 million words)
• D5.4 (t30) Final version of the Bilingual Dictionary Extractor
• D5.5 (t30) Sample of bilingual dictionaries produced: EN—FR and
• D5.6 (t30) Final version of the integrated Transfer Rules module, and
• D5.7 (t30) Sample of transfer rules produced for EN—DE.
• Bil ingual dictionary extraction (WP5.2)
Align bilingual corpus (existing or output from WP4)
– Sentence– Word– Chunk / Syntactic
• GIZA++, berkeleyaligner• word packing (“compound rich” languages,
• Marker hypothesis: Marclator• Syntactic: TreeAligner
– Integrate models: generative, syntactic,
– Extend range of language pairs– Tune to text type, domain and genre– Check/filter corpora acquired (comparability
– Baseline: phrase alignment in Moses– Extrinsic evaluation (SMT in WP7)
Task: to derive bilingual dictionaries from aligned parallel corpus Methodology
– Expectation-Maximisation algorithm– Additional techniques on top of word correspondences →
precision, fine-cleaning → reduce human intervention
– Go beyond word level: MW translations (NPs, MWEs)– Baseline: word alignment in Moses– Evaluation?
• Find criteria for lexical transfer selection
• structural transfer (Probst, Sánchez-Martínez, et al.)
– (matching of POS-sequences– independent of lexical material)
• bilingual term extraction (Cabré 2001, Gamal o 2007)
– structural transfer– lexical transfer
• simple lexical• contextual lexical <- this is the task! conditions for transfer selection
– with domain / subject area information („MEDICAL“)– with locale / variant („EN_UK“ „DE_CH“)
– use information on local nodes (gender, number)– use structural contexts (arguments, prepositions, subcategorisation
frames & fil ers) (main means of RMT)
– use conceptual environment for disambiguation
• using word sense disambiguation, statistical word alignment
• supervised learning of most important disambiguation
1. domain tag assignment2. morphosyntactic tests
• local features on gender / number• subcategorisation: Prepositions (for nouns and verbs)• presence / absence of verb arguments (trans./intrans.)• (relational Adj <-> compound specifier)
• source language concept clusters (SMT uses target
– Selection of disambiguation candidates (N, V, A)– Creation of paral el corpora – Creation of subcorpora for each translation
1. domain tags: do subcorpora differ in domain?2. morphosyntactic:
• gender: do they differ in gender? in number?• arguments: do they differ in transitivity? in subcategorised prepositions?
1. conceptual: Can different SL concept clusters be built to
• Verification with additional candidates or data
– Sentence Segmentiser, Tokeniser, Dictionary Lookup
• Parser to extract annotated subtrees• Tree matching component
• target-sensitive word sense disambiguation
– similar for the target side …) (if time permits)
Quality:
– a really parallel (not comparable) corpora aligned on sentence level – translation quality of aligned sentence pairs is essential for MT output Linguistic pre-processing:
– tokenized plain text (plain PB-SMT)– POS tagging, lemmatization (factored PB-SMT, EBMT)– constitutency and dependency parsing (syntax motivated PB-SMT)
– for a baseline system: at least 1M sentece pairs (~20M words)– for domain adaptation: 20K-200K sentece pairs (~400K-4M words)
EuroParl * JRC Acquis * News Commentary United Nations English-French OpenSubtitles
- numbers in millions of words from English to the target language- in corpora denoted by * all language pairs available
News (WMT) Gigaword ILSP EL corpus
- numbers in millions of words- monolingual parts of the parallel corpora also available
• A number of standard monolingual and parallel corpora available
for al languages pairs of sufficient size & quality
• Parliamentary proceedings and debates can be considered
• Monolingual web-crawled corpora available for English, French,
German, Italian (WaCky) – unspecified domain
• No web-crawled paral el data available at al (Resnik's Strand is
only a list of URLs, but quite outdated) – no fal back strategy
• EuroParl for baseline systems
– parliamentary proceedings and dabates
– quite general domain suitable for adaptation
• Evaluation data to be selected as a subset from
webcrawled in-domain data (including 500-2000 sentence pairs for test set and dev test set)
• Focus on translation from English to other languages Official deadlines:
– t6 Report on paral el technolgy tools (D5.1)
– t14 Aligners integrated in the platform (D5.2)
Internal deadlines:
– t6 decision on MT language pairs and domains
– t12 resources to be included in the first evaluation produced (D4.3)
Assumption: general and in-domain monolingual and Possible approaches:
– one system build from mixture of the data– two systems and a domain classifier (for sentences)– two systems and system combination based on their n-
• Distribution of webservices across partners?• Software requirements for webservices?• Hardware specifications (no HW budget)?• Example webservice wrapper?
• Rich text format support?• Duplicate document/sentence detection? • Distribution of webservices?
– TPC tools for one language on one site?
• MT tools integrated into the platform?
– alignment OK– language modelling?– phrase table extraction?– Decoding?– tuning?
• Only extrinsic automatic evaluation feasible
• Only extrinsic (MT) evaluation feasible
PREVALENCE OF DRUG RESISTANCE AND ASSOCIATED MUTATIONSIN HIV-POSITIVE PUERTO RICANS: SEX VARIATIONSIntroduction: A cross sectional study wasLuis A. Cubano, PhD; Lycely del C. Sepu´lveda-Torres, PhD;Greychan Sosa, BS; Nawal Boukli, PhD; Rafaela Robles, EdD; Jose´ W. evolution of HIV-1 infection in Puerto Rico byRodriguez, PhD; Lourdes Guzma´n, MT; Eddy Rı´os-Olivares, PhDmonitoring the e
Verhaltensstörungen – Verhalten, das uns stört? „Gestörtes“ Verhalten ist mehr als ein Symptom… der nicht mehr „normal“ kommunizieren kann, sich mitzuteilen …es ist sehr oft ein Hilferuf! Strukturen als Ende seiner Personalität Im Pflegeheim anderen auf Gedeih und Verderb ausgeliefert zu fühlen steigert diese Angst noch. Angst + Anonymität + Macht! Kom