Inside the CHiC - CLEF 2013

Polish Track at CLEF 2013

 
Co-Organized by the University of Neuchatel (Switzerland)
University of Wroclaw, and Nicolaus Copernicus University (Poland)
 
Contact
 
 
Polish Track at CLEF 2013

This Polish Track in a subtask inside the CHiC under the CLEF 2013 evaluation campaign

A joint work of the University of Neuchatel (Switzerland)
the University of Wroclaw, and the Nicolaus Copernicus University (Poland)
within the CLEF Initiative
Sponsored by the University of Neuchatel (Switzerland)

Proposed tasks based on the Europeana corpus

The following two tasks are ad-hoc monolingual IR in Polish (both the documents and the topics are written in the Polish language)

You can download the Polish topics (zip file) (or the corresponding English version, a zip file)
  • Automatic monolingual IR in Polish (no manual modification)
  • Manual monolingual IR in Polish (you can modify manually the queries or the documents)
  • You can have access of some information about the Europeana corpus
  • To download the Europeana collection
  • More explanation about this Polish track

    This task is a continuation of the 2012 CHiC monolingual lab, namely using topic descriptions written in the Polish language to retrieve cultural object descriptions also written in Polish. The Polish collection is part of the Europeana multilingual collection used in the 2012 and 2013 CHiC evaluation campaigns. A dedicated set of 50 Polish topic descriptions will be generated only for this Polish task (a subset of which will be reused in CHiC 2013 Multilingual task).

    The main objective of this task is to have a better understanding of information retrieval for complex languages such as the Polish one. We do know that the complex morphology of the Polish language may have an impact on both retrieval effectiveness and relevance. Can this aspect be ignored at this level, under the assumption that this morphological complexity will not impact the retrieval performance? If not, then what is the impact on retrieval effectiveness in the cases of having a poorer or a better understanding of the Polish morphology?

    Two subtasks will be offered:

    Participants are allowed to submit either fully AUTOMATIC or fully MANUAL experiments (or both types separately). When submitting their results, the participants must specify the type of each run (either automatic or manual). A detailed description of the CHiC Europeana collection can be found in the CHiC multilingual task description. For the current task, only the Polish subset is needed.

    1. Automatic

    Within the automatic mode, the participants are free to use the tags they want for indexing the various cultural heritage (CH) objects (see our comments below on this matter). From the topics set, they can use the title section ONLY. Regarding the topics titles or the CH objects descriptions, participants are free to automatically enrich the corresponding queries and/or document surrogates (e.g., using specific thesauri, dedicated ontologies or the web in general). Moreover, automatic blind feedback or query expansion mechanisms are allowed to hopefully improve the system ranking.

    2. Manual

    For a manual setting, the participants are free to use any source of knowledge / tools / strategies to modify and enrich the CH objects and/or the queries. No further user-system interaction is assumed after the first set of results is retrieved (but automatic blind feedback or query expansion mechanisms are allowed).

    Task definition

    In both cases (automatic and manual settings), these tasks are standard ad-hoc retrieval ones, which measure information retrieval effectiveness with respect to user input in the form of queries. The ad-hoc setting is the standard setting for an information retrieval system. Within such a setting, the system is required to produce a relevance-ranked list of documents and based entirely on the query on hand and the features of the collection. This list is produced without any prior knowledge by the system about neither the user needs nor the context.

    Requirement for submission

    Participants will be asked to submit at least one run in one subtask and a maximum of five for each subtask (as a maximum, a participant may submit five automatic runs and five manual runs). For each run, the participant should indicate a priority value which will be used to form the pool of documents from which the relevance assessments will be created.

    Topics

    Topic descriptions consist of a mixture of topical and named-entity queries. The 50 short topics in title-format only (e.g. "Fox hunting", "Images of castles in Warsaw") tend to reflect information needs as expressed by real Europeana users. The topics for the CHiC Polish ad-hoc task will be in Polish with an additional translation in English. Throughout their experiments, participants are allowed to use the Polish topics only. The English translated topics are simply there to provide an overview of the topic meaning. An example is given below:

    <topic lang="pl">
    <identifier >CHIC-2013-PL-008</identifier>
    <title >ruch robotniczy</title>
    </topic>

    <topic lang="en">
    <identifier >CHIC-2013-PL-008</identifier>
    <title>workers movement</title>
    </topic>

    Expected results

    Participants are expected to submit relevance-ranked result lists for all 50 topics in a TREC-style format and using documents from the Polish chapter of the Europeana collection.

    Relevance assessments

    Relevance assessments will be done manually first by collaboratively generating an assumed information need for the query and then describing it (the resulting description will be used for later editions of this CHiC task). The pooled documents are then assessed for their relevance according to the query + information need. This assumption is built around the perspective of an average user. We assume that the majority of users typing that particular query would like to obtain that particular piece of information.

    Evaluation metrics

    The evaluation metrics for the ad-hoc task will be the standard information retrieval measures of precision and recall, particularly the standard mean average precision (MAP) and precision@k measures.

    Helpful links

  • Various language tools at UniNE
  • Overview of the Polish language
  • Our comments on the Polish corpus
  • Registration form
  • Information about submission

  • Tentative schedule (2013)

    February:  Polish corpus available
    March:  Topics release (in TD format, 50)
    End-April:  Runs due
    End-May:  Relevance assements available
    Mid-June:  Working papers due
    September:  CLEF 2013 Conference

    Organizers

    Prof. Jacques Savoy
    Dept. of Computer Science
    University of Neuchatel, (Switzerland)
    Dr. Piotr Malak
    Dept. of Computer Science
    University of Neuchatel, (Switzerland) and
    University of Nicolaus Copernicus (Poland)
    Prof. Adam Pawlowski
    Information Sciences Institute
    University of Wroclaw (Poland)
    For more information, please contact Prof. Jacques Savoy