Overview |
Current methods in corpus linguistics
Méthodes actuelles en linguistique de corpus
Bilingual seminar of the PhD school CUSO 2018 (Computer Science and Linguistics)
Speakers
Organized by
Topics
- Monday, Oct 22th
- 13:30 - 15:30 Introduction to R for linguists: Part I (J. Savoy)
- 16:00 - 17:30 Introduction to R for linguists with exercises: Part II (J. Savoy)
- 17:30 - 18:00 PhD students presentation
- Tuesday, Oct 23th
- 9:00 - 12:30 Le sens des mots (D. Labbé)
- 14:00 - 17:30 Classification problems in the analysis of textual data (A. Tuzzi)
- 18:00 - 18:30 PhD students presentation
- Wednesday, Oct 24th
- 9:00 - 12:30 Multivariate statistics in corpus-linguistic analyses (G. Desagulier)
Content
This seminar is bilingual with speakers talking in English or French.
Cette école doctorale a pour but d'approfondir les méthodologies
récentes dans le domaine de la linguistique outillée de corpus,
ainsi que d'illustrer l'application de ces méthodologies aux questions de recherche en syntaxe,
sémantique, pragmatique et analyse du discours.
Le sens des mots (D. Labbé)
La langue est un système de systèmes ("structure") qui permet à ses
usagers de communiquer en donnant, à peu près, le même sens aux mots
qu'ils utilisent. Comment reconstituer ces significations dans de
grands ensembles de textes électroniques (corpus) ? L'exposé présente
les outils lexicométriques qui apportent des réponses intéressantes à
cette question. Après avoir rappelé comment s'organise le lexique
d'une langue, et comment on peut l'approcher grâce à de vastes corpus,
on montrera comment reconstituer, par le calcul, les univers lexicaux
qui structurent le vocabulaire de ces textes. L'exposé utilisera les
corpus "Discours politique (français)", en ligne sur le Centre de
Linguistique de Corpus de l'Université de Neuchatel.
Classification problems in the analysis of textual data (A. Tuzzi)
Clustering methods concern the unsupervised classification of a set of objects into a limited number of
clusters and they are aimed at grouping similar objects to form consistent groups and separating dissimilar
objects into distinct groups. Classification methods play a relevant role in the exploration of text
corpora to achieve a good arrangement of texts, words or other linguistic features.
The lecture will take into account the issue of clustering texts and words within the specific
frame of bag-of-words approaches, mainly focussing on the lexical level and essentially based on
word counts. Some examples of text clustering and word clustering that have been adopted to solve
specific research question will be illustrated.
In the case of text clustering, the lecture will discuss different choices concerning the measures
of similarity (or dissimilarity) should be adopted and which (and how many) words should be considered.
In the case of word clustering, the lecture will deal with diachronic corpora and the problem of
identifying words that portray similar temporal patterns in terms of occurrences.
Multivariate statistics in corpus-linguistic analyses (G. Desagulier)
When corpus observations are described by several variables, the data are collected in multidimensional tables.
I present four methods that are designed to explore and summarize such tables by means of summary statistics:
correspondence analysis, multiple correspondence analysis, principal component analysis, and
t-distributed Stochastic Neighbor Embedding.
These methods help generate hypotheses by providing informative clusters using the variable values that
characterize each observation.
Each method is illustrated with several corpus-based case studies at the syntax-semantics interface.
The inscription to this seminar is free of charge but mandatory
(CUSO website).
|