Université
de Neuchâtel
Logo UniNE
 
 
   UniNE > IIUN annuaire | plan du site | accès | contact  
IR Multilingual Resources at UniNE

Have a look at the CLEF site (European languages) or NTCIR (Asian languages) providing other information about multilingual retrieval

Our stemming procedure and stopword list are enhanced continuously. However, for the French and the German language, our solutions can be viewed as definitive.

Machine translation
Dictionaries
Corpora
Babel Fish (Systran system)
EuroWordNet (license available from ELRA)
Part-of-speech tagger (English) from University of Edinburgh (UK).
Reverso.net (in French)
Reverso Online (in English)
Others CLIR resources from University of Maryland at College Park
 
Multilingual dictionaries
(from / to 15 languages)
CJK linguistics resources (for Asian languages).

The language
Stopword List
Stemmer
1,000 most frequent words
The English Language
in English (571 words from Smart)
  in English
The French Language
   Trésor de la langue française
   Dictionnaire de l'Académie
in French (463 words)
for French in C
more complex French stemmer
in French
The German Language
   German morphology
in German (603 words)
for German in C
more complex German stemmer
more information available
in German
The Italian Language
in Italian (430 words)
for Italian in C
more information available
in Italian
The Spanish and Portuguese Language
in Spanish (307 words)
in Spanish (351 words from Smart)
for Spanish in C in Spanish
The Spanish and Portuguese Language
in Portuguese (356 words)
or a variant (392 words)
for Portuguese in C   
The Finnish Language
in Finnish (747 words)
(older version, 1,134 words)
for Finnish in C in Finnish
The Swedish Language
(Svenska Akademiens Ordbok dictionary in Swedish)
in Swedish (386 words,
draft version)
for Swedish in C
in Swedish
La langue arabe
   Resources for Arabic
in Arabic (162 words in UTF-8)
and character converter (in Perl)
simple stem2 or stem3 in C
(after using our character converter)
in Arabic (UTF-8)
A Russian grammar in Russian (420 words in UTF-8)
and character converter (in Perl)
for Russian in Perl
a light stemmer in Java
a more aggressive stemmer in Java
in Russian (UTF-8)
The Hungarian Language in Hungarian (737 words in UTF-8) for Hungarian in C
(after removing the accents)
  
The Bulgarian Language in Bulgarian (258 words in UTF-8) for Bulgarian (in perl, in UTF-8)   
A Roumanian grammar in Roumanian (282 words in UTF-8)      
A Czech grammar in Czech (257 words in UTF-8) a light stemmer in Czech in Java
a more aggressive stemmer in Java
  
The Polish Language
   a Polish grammar (in pdf)
in Polish (138 words in UTF-8)
other list suggested by Kamil Wegrzynowicz (273 words in UTF-8)
a Polish stemmer suggested by Dawid Weiss another Polish stemmer written by Andrzej Bialecki   
The Persian (Farsi) Language in Persian ( words in UTF-8) for Persian (in Java with Arabic letters)
for Persian (in Java with Unicode)
  
The Hindi Language in Hindi (165 words in UTF-8)      
The Marathi Language in Marathi (99 words in UTF-8)      
The Bengali Language in Bengali (114 words in UTF-8)      

Grammars and other sources of information

Various on-line grammars and language courses are available.

Various transliteration schemes are available from the Library of Congress.

Comments on stopword lists

In establishing a general stopword list for other languages than English, we followed the guidelines described in (Fox, 1990).

First, we sorted all word forms appearing in our corpora according to their frequency of occurrence and we extracted the 200 most frequently occurring words.

Second, we inspected this list to remove all numbers (e.g., "1994", "1"), plus all nouns and adjectives more or less directly related to the main subjects of the underlying collections. For example, the German word "Prozent" (ranking 69) or the Italian noun "Italia" (ranking 87) were removed from the final list. From our point of view, such words can be useful as indexing terms in certain circumstances.

Third, we included some non-information-bearing words, even if they did not appear in the first 200 most frequent words. For example, we added various personal or possessive pronouns (such as "meine" ("my" in German), prepositions ("nello" ("in the" in Italian)) and conjunctions ("où" ("where" in French)).

The presence of homographs represents another debatable issue, and to some extent, we had to make arbitrary decisions concerning their inclusion in stopword lists. For example, the French word "son" can be translated as "sound" or "his", and the French term "or" as "thus/therefore" or "gold"..

The resulting stopword list thus contained a large number of pronouns, articles, prepositions and conjunctions. As in various English stopword lists, there were also some verbal forms ("sein" (to be in German), "essere" (to be in Italian), "sono" ("I am" in Italian)).

Comments on stemmers

In proposing stemmers for other languages than English, we think that a "light" stemmer (removing inflections only for noun and adjectives) presents some advantages. Our stemming procedure for French is described in (Savoy, 1999). In Italian, the main inflectional rule is to modify the final character (e.g., «-o», «-a» or «-e») into another (e.g., «-i», «-e»). As a second rule, Italian morphology may also alter the final two letters (e.g., «-io» in «-o», «-co» in «-chi», «-ga» in «-ghe»). In German, a few rules may be applied to obtain the plural form of words (e.g., "Frau" into "Frauen" (woman), "Bild" into "Bilder" (picture), "Sohn" into "Söhne" (son), "Apfel" into "Äpfel" (apple)), but the suggested algorithms do not account for person and tense variations, or for the morphological variations used by verbs (we think that indexing verbs for Italian, French or German is not of primary importance compared to nouns and adjectives).

Software available

References

Abdou, S., Savoy, J. (2007).Monolingual experiments with Far East languages in NTCIR-6. In Proceedings of the Sixth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF).

Savoy, J. Abdou, S. (2006). UniNE at CLEF-2006: experiments with Monolingual, Bilingual and Domain-Specific and Robust Retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2006, (to appear), (version PDF).

Abdou, S., Savoy, J. (2005). Report on CLIR Task for the NTCIR-5 Evaluation Campaign. In Proceedings of the Fifth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (version PDF).

Savoy, J. Berger, P. Y. (2005). Report on CLEF-2005 evaluation campaign: Monolingual, bilingual, and GIRT information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2005, (to appear), (version PDF).

Savoy, J. (2004a). Selection and merging strategies for multilingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF).

Savoy, J. (2004b). Data fusion for effective European monolingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF).

Savoy, J. (2004c). Report on CLIR Task for the NTCIR-4 Evaluation Campaign. In Proceedings of the Four NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF).

Savoy, J. (2003a). Report on CLEF-2003 Multilingual Tracks. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF).

Savoy, J. (2003b). Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF).

Savoy, J. (2002b). Report on CLEF-2002 Experiements: Combining multiple sources of evidence. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2002, cross-language evaluation forum, (to appear), (version PDF, my presentation at CLEF-2002, and my presentation about Amaryllis).

Savoy, J. (2002a). Morphologie et recherche d'information. Technical report (version PDF).

Savoy, J. (2001a). Report on CLEF-2001 Experiements. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2001, cross-language system evaluation campaign, (pp. 11-19). Sophia-Antipolis: ERCIM (version PDF and my presentation at CLEF-2001).

Savoy, J. (2001b). Bilingual information retrieval: CLEF-2000 experiments. Proceedings Workshop-ECSQARU-2001 management of uncertainty and imprecision in multimedia information systems, (pp. 53-63). Toulouse: IRIT (version PDF).

Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944-952.

Savoy, J. (1993). Stemming of French words based on grammatical category. Journal of the American Society for Information Science, 44(1), 1-9.

Fox C. (1990). A stop list for general text. ACM-SIGIR Forum, 24, 19-35.

Copyright

All the software/information given out on this Web site is covered by the BSD License (see http://www.opensource.org/licenses/bsd-license.html), with Copyright (c) 2005, Jacques Savoy.

Essentially, all this means is that you can do what you like with the code, except claim another Copyright for it, or claim that it is issued under a different license. The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give this software/information to the fact that it is covered by the BSD license.



Prof. Jacques Savoy
University of Neuchatel
Computer Science Department
Rue Emile-Argand 11
CH-2009 Neuchâtel
Switzerland


+41 32 718 1375 (phone)
+41 32 718 2701 (fax)