Linguistic Resources (Level 1)
This course will be jointly organised by Lars Borin (Göteborg University) http://svenska.gu.se/~svelb/ and Daniel Hardt (Copenhagen Business School) http://www.id.cbs.dk/~dh/
The start of the class will be in Göteborg during the week beginning 13th September, 2004 followed by one meeting november 5th in Copenhagen and one in January 2005 (Stockholm)
Purpose
The purpose of this course is to provide a research-oriented introduction to linguistic resources, their uses and growing importance in the field of language technology.
Overview
The focus of the course will be on linguistic data resources, while linguistic algorithmic resources (a.k.a. tools) will be treated only incidentally, as needed to elucidate some aspect of the data resources. Thus delimited, linguistic resources basically come in three flavors:
- corpus resources (text corpora of written or spoken language, speech databases, digitized video, etc.);
- lexical resources;
- grammatical resources (these are the most difficult to treat separately from the tools for using them).
These resources are further, and orthogonally, defined by their
- modality: written language, spoken language, speech, sign language, multimodal;
- text/language type, genre, sublanguage, etc.;
- language(s): monolingual - bilingual - multilingual;
In reference to the resources, important general issues - of both theoretical and practical interest - are
- the purpose(s) for which the resource was created;
- (linguistic) annotation types and annotation schemes;
- storage, interchange and metadata formats, standardization, and general tools for working with the resources ('toolkits', 'workbenches');
- creation, acquisition, distribution and reuse of resources (including intellectual property issues);
Content
Topics to be covered will include:
Lexical Resources
- The Danish STO project
Text Corpora and Markup
- Tokenizing
- Part of Speech Tagging
- Parsing
Syntax Treebanks
- Danish Treebank
- Penn Treebank
Discourse Treebanks
- (RST) Discourse Treebank
- Penn Discourse Treebank
Parallel Corpora
- Building Parallel Corpora
- Using Parallel Corpora for MT
Spoken Language Corpora/Speech Technology