Orateur : Prof. Raphael Salkie, University of Brighton
Bilingual corpora of texts in one language and their translations into another language are excellent research resources in a number of fields:
Linguistics, Language Engineering, Machine Translation, Translation
theory, bilingual Lexicography, etc. Currently there are few corpora available, and they have only been exploited in limited ways. There are many reasons for this, and this paper addresses some of them by looking at one corpus in particular.
The Brighton INTERSECT corpus contains about 1.5 million words in French and English (along with about 800,000 words in German and English). It is sentence-aligned (manually) and contains a wide variety of text genres (fiction, newspapers, science, business, UN and EU documents, instructions, etc).
For some types of research the corpus is useable, but it is inadequate in many ways:
- No lemmatisation
- Not TEI-conformant
- Some texts are not modern (post-1945)
- Copyright issues
- Adding new texts is very labour-intensive
To enhance the corpus and solve these problems, we need to reduce the amount of manual labour involved. I argue firstly that we need to analyse the needs of users more carefully; secondly, that the central change should be paragraph-alignment; thirdly, that we need a few simple software tools that are not currently available. The result would be a huge improvement to the corpus, and the tools would also make it easier
for other researchers to work with their own parallel texts.
![[LIPN]](/blog-themes/lipn-automne/img/logo_lipn.png)
![[CNRS]](/blog-themes/lipn-automne/img/logo_cnrs.png)
![[Université Paris 13]](/blog-themes/lipn-automne/img/logo_paris13.png)
About the ICS format