Université de Neuchâtel |
![]() |
IR Multilingual Resources at UniNE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Have a look at the CLEF site (European languages) or NTCIR (Asian languages) providing other information about multilingual retrievalOur stemming procedure and stopword list are enhanced continuously.
Grammars and other sources of informationVarious on-line grammars and language courses are available. Want to learn a foreign language? Have a look at Language Resource Review. Various transliteration schemes are available from the Library of Congress. Comments on stopword listsIn establishing a general stopword list for other languages than English, we followed the guidelines described in (Fox, 1990). First, we sorted all word forms appearing in our corpora according to their frequency of occurrence and we extracted the 200 most frequently occurring words. Second, we inspected this list to remove all numbers (e.g., "1994", "1"), plus all nouns and adjectives more or less directly related to the main subjects of the underlying collections. For example, the German word "Prozent" (ranking 69) or the Italian noun "Italia" (ranking 87) were removed from the final list. From our point of view, such words can be useful as indexing terms in certain circumstances. Third, we included some non-information-bearing words, even if they did not appear in the first 200 most frequent words. For example, we added various personal or possessive pronouns (such as "meine" ("my" in German), prepositions ("nello" ("in the" in Italian)) and conjunctions ("où" ("where" in French)). The presence of homographs represents another debatable issue, and to some extent, we had to make arbitrary decisions concerning their inclusion in stopword lists. For example, the French word "son" can be translated as "sound" or "his", and the French term "or" as "thus/therefore" or "gold".. The resulting stopword list thus contained a large number of pronouns, articles, prepositions and conjunctions. As in various English stopword lists, there were also some verbal forms ("sein" (to be in German), "essere" (to be in Italian), "sono" ("I am" in Italian)). Comments on stemmersIn proposing stemmers for other languages than English, we think that a "light" stemmer (removing inflections only for noun and adjectives) presents some advantages. Our stemming procedure for French is described in (Savoy, 1999). In Italian, the main inflectional rule is to modify the final character (e.g., «-o», «-a» or «-e») into another (e.g., «-i», «-e»). As a second rule, Italian morphology may also alter the final two letters (e.g., «-io» in «-o», «-co» in «-chi», «-ga» in «-ghe»). In German, a few rules may be applied to obtain the plural form of words (e.g., "Frau" into "Frauen" (woman), "Bild" into "Bilder" (picture), "Sohn" into "Söhne" (son), "Apfel" into "Äpfel" (apple)), but the suggested algorithms do not account for person and tense variations, or for the morphological variations used by verbs (we think that indexing verbs for Italian, French or German is not of primary importance compared to nouns and adjectives). ReferencesDolamic, L., Savoy, J. (2010). Comparative Study of Indexing and Search Strategies for the Hindi, Marathi and Bengali Languages. ACM – Transactions on Asian Language Information Processing, 9(3). (version PDF). Dolamic, L., Savoy, J. (2010). When Stopword Lists Make the Difference. Journal of the American Society for Information Sciences and Technology, 61(1), 200-203 (version PDF). Savoy, J., Dolamic L. (2009). How effective is Google’s translation service in search? Communications of the ACM, 52(10), 139-143 (version PDF). Dolamic, L., Savoy, J. (2009). Indexing and Searching Strategies for the Russian Language. Journal of the American Society for Information Sciences and Technology, 60(12), 2540-2547 (version PDF). Dolamic, L., Savoy, J. (2009). Indexing and Stemming Approaches for the Czech Language. Information Processing & Management, 45(6), 714-720 (version PDF). Savoy, J. (2008). Searching Strategies for the Bulgarian Language. IR Journal, 10(6), 509-529. (version PDF). Abdou, S., Savoy, J. (2007).Monolingual experiments with Far East languages in NTCIR-6. In Proceedings of the Sixth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF). Savoy, J. Abdou, S. (2006). UniNE at CLEF-2006: experiments with Monolingual, Bilingual and Domain-Specific and Robust Retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2006, (to appear), (version PDF). Savoy, J. (2005). Comparative Study of Monolingual and Multilingual Search Models for Use with Asian Languages. ACM Transactions on Asian Languages Information Processing, 4(2), 163-189 (version PDF). Abdou, S., Savoy, J. (2005). Report on CLIR Task for the NTCIR-5 Evaluation Campaign. In Proceedings of the Fifth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (version PDF). Savoy, J. Berger, P. Y. (2005). Report on CLEF-2005 evaluation campaign: Monolingual, bilingual, and GIRT information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2005, (to appear), (version PDF). Savoy, J. (2004a). Selection and merging strategies for multilingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF). Savoy, J. (2004b). Data fusion for effective European monolingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF). Savoy, J. (2004c). Report on CLIR Task for the NTCIR-4 Evaluation Campaign. In Proceedings of the Four NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF). Savoy, J. (2003a). Report on CLEF-2003 Multilingual Tracks. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF). Savoy, J. (2003b). Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF). Savoy, J. (2002b). Report on CLEF-2002 Experiements: Combining multiple sources of evidence. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2002, cross-language evaluation forum, (to appear), (version PDF, my presentation at CLEF-2002, and my presentation about Amaryllis). Savoy, J. (2002a). Morphologie et recherche d'information. Technical report (version PDF). Savoy, J. (2001a). Report on CLEF-2001 Experiements. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2001, cross-language system evaluation campaign, (pp. 11-19). Sophia-Antipolis: ERCIM (version PDF and my presentation at CLEF-2001). Savoy, J. (2001b). Bilingual information retrieval: CLEF-2000 experiments. Proceedings Workshop-ECSQARU-2001 management of uncertainty and imprecision in multimedia information systems, (pp. 53-63). Toulouse: IRIT (version PDF). Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944-952. Savoy, J. (1993). Stemming of French words based on grammatical category. Journal of the American Society for Information Science, 44(1), 1-9. Fox C. (1990). A stop list for general text. ACM-SIGIR Forum, 24, 19-35. CopyrightAll the software/information given out on this Web site is covered by the BSD License (see http://www.opensource.org/licenses/bsd-license.html), with Copyright (c) 2005, Jacques Savoy. Essentially, all this means is that you can do what you like with the code, except claim another Copyright for it, or claim that it is issued under a different license. The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give this software/information to the fact that it is covered by the BSD license. Prof. Jacques Savoy University of Neuchatel Computer Science Department Rue Emile-Argand 11 CH-2000 Neuchâtel Switzerland Jacques.Savoy@unine.ch |