Inside the CHiC - CLEF 2013

Polish Track at CLEF 2013

 
Co-Organized by the University of Neuchatel (Switzerland)
University of Wroclaw, and Nicolaus Copernicus University (Poland)
 
Contact
 
 
Comments on the Polish Track at CLEF 2013

This Polish Track in a subtask inside the CHiC under the CLEF 2013 evaluation campaign

Proposed tasks based on the Europeana corpus

General Comment

In the Polish chapter of the Europeana corpus, we can find 1,093,705 documents (or CH object descriptors). To identify each document, the tag ims:identifier is used (and must be used to uniquely refer to the documents returned in the resulting ranked list). According to Lucene search engine (rounded, with stopwords), the mean number of words in a typical object descriptor is around 35 words per document.

After examining the tags available in the Europeana collection, we found the following ones to be of particular interest.

<dc:contributor>
<dc:creator>
<dc:date>
<dc:language>
<dc:subject>
<dc:title>
<dc:type>
<dcterms:alternative>
<dcterms:created>
<europeana:language>
<europeana:type>
<europeana:uri>
<europeana:year>
<ims:chic/ims:metadata/ims:fields/enrichment:concept_broader_label>
<ims:chic/ims:metadata/ims:fields/enrichment:concept_label>
<ims:chic/ims:metadata/ims:fields/enrichment:period_label>
<ims:chic/ims:metadata/ims:fields/enrichment:place_broader_label>

And here you can find an example of a CH object description.

<ims:fields>
<dc:contributor>Kopera, Feliks (1871-1952)</dc:contributor>
<dc:creator>Gottlieb, Maurycy (1856-1879)</dc:creator>
<dc:date>[1923]</dc:date>
<dc:language>pol</dc:language>
<dc:subject>18-19 w. - ikonografia</dc:subject>
<dc:subject>19-20 w. - ikonografia</dc:subject>
<dc:subject>Judaica - ikonografia</dc:subject>
<dc:subject>Malarstwo ?ydowskie - ikonografia</dc:subject>
<dc:title>Maurycy Gottlieb 1856-1879 : 26 reprodukcji wed?ug obrazów mistrza</dc:title>
<dc:type>grafika</dc:type>
<europeana:language>pl</europeana:language>
<europeana:type>IMAGE</europeana:type>
<europeana:uri>http://www.europeana.eu/resolve/record/92033/0970289D530CDAA11119BD4176B27D727C02A070</europeana:uri>
<europeana:year>1923</europeana:year></ims:fields>

First, an object can be described by only a subset of the possible tags. Not all tags are always present, and in many cases, some tags are empty. Moreover, some tags have multiple appearances in the description of a single object with different contents to them (e.g., the dc:subject tag). The dc:language and europeana:language tags indicate the language used to describe an object, but they are not necessarily equivalent to one another. For some objects, the title field can be written in another language (e.g., German, Yiddish, English, or undefined) but the dc:subject tags is written in Polish, and probably, an equivalent title in Polish is provided (at least to the best of our knowledge). Certain tags may have short contents such as the europeana:type tag whose content can be either IMAGE or TEXT which corresponds to the type (but not to the medium) of the CH object on hand.

After inspecting the Polish dataset, one can assume that the description is correctly spelled. However, occasional spelling errors may be encountered but we can estimate that this phenomenon is marginal. In the title field however, we may encounter some formulations reflecting the Polish language used in the 18th century.

In our dedicated Web page, you can find a stopword list for the Polish language as well as some stemmers dedicated for this language. These tools constitute a starting point for you as we highly encourage you to develop your own new linguistic tools for the Polish language.

Organizers

Prof. Jacques Savoy
Dept. of Computer Science
University of Neuchatel, (Switzerland)
Dr. Piotr Malak
Dept. of Computer Science
University of Neuchatel, (Switzerland) and
University of Nicolaus Copernicus (Poland)
Prof. Adam Pawlowski
Information Sciences Institute
University of Wroclaw (Poland)
For more information, please contact Prof. Jacques Savoy