This Polish Track in a subtask inside the CHiC
under the CLEF 2013 evaluation campaign
Proposed tasks based on the Europeana corpus
The following two tasks are ad-hoc monolingual IR in Polish (both the documents and the topics are written in the Polish language)
You can download the
Polish topics (zip file)
(or the corresponding English version, a zip file)
Automatic monolingual IR in Polish (no manual modification)
Manual monolingual IR in Polish (you can modify manually the queries or the documents)
You can have access of some information about the Europeana corpus
To download the Europeana collection
More explanation about this Polish track
This task is a continuation of the 2012 CHiC monolingual lab, namely using topic descriptions written in the Polish language to retrieve cultural object descriptions also written in Polish. The Polish collection is part of the Europeana multilingual collection used in the 2012 and 2013 CHiC evaluation campaigns. A dedicated set of 50 Polish topic descriptions will be generated only for this Polish task (a subset of which will be reused in CHiC 2013 Multilingual task).
The main objective of this task is to have a better understanding of information retrieval for complex languages such as the Polish one. We do know that the complex morphology of the Polish language may have an impact on both retrieval effectiveness and relevance. Can this aspect be ignored at this level, under the assumption that this morphological complexity will not impact the retrieval performance? If not, then what is the impact on retrieval effectiveness in the cases of having a poorer or a better understanding of the Polish morphology?
Two subtasks will be offered:
Participants are allowed to submit either fully AUTOMATIC or fully MANUAL experiments (or both types separately). When submitting their results, the participants must specify the type of each run (either automatic or manual). A detailed description of the CHiC Europeana collection can be found in the CHiC multilingual task description. For the current task, only the Polish subset is needed.
Within the automatic mode, the participants are free to use the tags they want for indexing the various cultural heritage (CH) objects (see our comments below on this matter). From the topics set, they can use the title section ONLY. Regarding the topics titles or the CH objects descriptions, participants are free to automatically enrich the corresponding queries and/or document surrogates (e.g., using specific thesauri, dedicated ontologies or the web in general). Moreover, automatic blind feedback or query expansion mechanisms are allowed to hopefully improve the system ranking.
For a manual setting, the participants are free to use any source of knowledge / tools / strategies to modify and enrich the CH objects and/or the queries. No further user-system interaction is assumed after the first set of results is retrieved (but automatic blind feedback or query expansion mechanisms are allowed).
In both cases (automatic and manual settings), these tasks are standard ad-hoc retrieval ones, which measure information retrieval effectiveness with respect to user input in the form of queries. The ad-hoc setting is the standard setting for an information retrieval system. Within such a setting, the system is required to produce a relevance-ranked list of documents and based entirely on the query on hand and the features of the collection. This list is produced without any prior knowledge by the system about neither the user needs nor the context.
Requirement for submission
Participants will be asked to submit at least one run in one subtask and a maximum of five for each subtask
(as a maximum, a participant may submit five automatic runs and five manual runs). For each run, the participant should indicate a priority value which will be used to form the pool of documents from which the relevance assessments will be created.
Topic descriptions consist of a mixture of topical and named-entity queries. The 50 short topics in title-format only (e.g. "Fox hunting", "Images of castles in Warsaw") tend to reflect information needs as expressed by real Europeana users. The topics for the CHiC Polish ad-hoc task will be in Polish with an additional translation in English. Throughout their experiments, participants are allowed to use the Polish topics only. The English translated topics are simply there to provide an overview of the topic meaning. An example is given below:
<title >ruch robotniczy</title>
Participants are expected to submit relevance-ranked result lists for all 50 topics in a TREC-style format and using documents from the Polish chapter of the Europeana collection.
Relevance assessments will be done manually first by collaboratively generating an assumed information need for the query and then describing it (the resulting description will be used for later editions of this CHiC task). The pooled documents are then assessed for their relevance according to the query + information need. This assumption is built around the perspective of an average user. We assume that the majority of users typing that particular query would like to obtain that particular piece of information.
The evaluation metrics for the ad-hoc task will be the standard information retrieval measures of precision and recall, particularly the standard mean average precision (MAP) and precision@k measures.
Various language tools at UniNE
Overview of the Polish language
Our comments on the
Information about submission
Tentative schedule (2013)
February: Polish corpus available
March: Topics release (in TD format, 50)
End-April: Runs due
End-May: Relevance assements available
Mid-June: Working papers due
September: CLEF 2013 Conference