Seminar of the CUSO PhD School

Current methods in corpus linguistics

October 22-24th, 2018
Les Diablerets
 
Contact
 
 
 
Overview

Current methods in corpus linguistics
Méthodes actuelles en linguistique de corpus

Bilingual seminar of the PhD school CUSO 2018 (Computer Science and Linguistics)

In Les Diablerets, October 22-24th 2018

Speakers

Prof. Guillaume Desagulier
Departement d'Etudes des Pays Anglophones
Universite de Paris 8

Prof. Dominique Labbé
Pacte - Sciences sociales
Universite de Grenoble

Prof. Robert Morrissey
Romance Languages and Literatures
University of Chicago

Prof. Arjuna Tuzzi
Dipartimento di Filosofia, Sociologia, Pedagogia e Psicologia applicata (FISPPA)
Università degli Studi di Padova

Organized by

Prof. C. Rossari
Prof. J. Savoy
Prof. M. Hilpert
Dr. C. Ricci


Topics

  • Monday, Oct 22th
  • 13:30 - 15:30 Introduction to R for linguists: Part I (J. Savoy)
  • 16:00 - 17:30 Introduction to R for linguists with exercises: Part II (J. Savoy)
  • 17:30 - 18:00 PhD students presentation
  • Tuesday, Oct 23th
  • 9:00 - 12:30 Le sens des mots (D. Labbé)
  • 14:00 - 17:30 Classification problems in the analysis of textual data (A. Tuzzi)
  • 18:00 - 18:30 PhD students presentation
  • Wednesday, Oct 24th
  • 9:00 - 12:30 Multivariate statistics in corpus-linguistic analyses (G. Desagulier)

Content

This seminar is bilingual with speakers talking in English or French.

Cette école doctorale a pour but d'approfondir les méthodologies récentes dans le domaine de la linguistique outillée de corpus, ainsi que d'illustrer l'application de ces méthodologies aux questions de recherche en syntaxe, sémantique, pragmatique et analyse du discours.

Le sens des mots (D. Labbé)

La langue est un système de systèmes ("structure") qui permet à ses usagers de communiquer en donnant, à peu près, le même sens aux mots qu'ils utilisent. Comment reconstituer ces significations dans de grands ensembles de textes électroniques (corpus) ? L'exposé présente les outils lexicométriques qui apportent des réponses intéressantes à cette question. Après avoir rappelé comment s'organise le lexique d'une langue, et comment on peut l'approcher grâce à de vastes corpus, on montrera comment reconstituer, par le calcul, les univers lexicaux qui structurent le vocabulaire de ces textes. L'exposé utilisera les corpus "Discours politique (français)", en ligne sur le Centre de Linguistique de Corpus de l'Université de Neuchatel.

Classification problems in the analysis of textual data (A. Tuzzi)

Clustering methods concern the unsupervised classification of a set of objects into a limited number of clusters and they are aimed at grouping similar objects to form consistent groups and separating dissimilar objects into distinct groups. Classification methods play a relevant role in the exploration of text corpora to achieve a good arrangement of texts, words or other linguistic features.
The lecture will take into account the issue of clustering texts and words within the specific frame of bag-of-words approaches, mainly focussing on the lexical level and essentially based on word counts. Some examples of text clustering and word clustering that have been adopted to solve specific research question will be illustrated.
In the case of text clustering, the lecture will discuss different choices concerning the measures of similarity (or dissimilarity) should be adopted and which (and how many) words should be considered.
In the case of word clustering, the lecture will deal with diachronic corpora and the problem of identifying words that portray similar temporal patterns in terms of occurrences.

Multivariate statistics in corpus-linguistic analyses (G. Desagulier)

When corpus observations are described by several variables, the data are collected in multidimensional tables. I present four methods that are designed to explore and summarize such tables by means of summary statistics: correspondence analysis, multiple correspondence analysis, principal component analysis, and t-distributed Stochastic Neighbor Embedding.
These methods help generate hypotheses by providing informative clusters using the variable values that characterize each observation. Each method is illustrated with several corpus-based case studies at the syntax-semantics interface.

The inscription to this seminar is free of charge but mandatory (CUSO website).