IR Multilingual Resources at UniNE

Have a look at the CLEF site (European languages) or NTCIR (Asian languages) providing other information about multilingual retrieval

Our stemming procedure and stopword list are enhanced continuously.

Machine translation	Dictionaries	Corpora
Babel Fish	Freedict.com	Linguistic Data Consortium
Google Translate	Foreignword.com	European Language Resource Association (ELRA)
FreeTranslation.com	Leo dictionaries	EuroWordNet (license available from ELRA)
InterTran	YourDictionary.com	Part-of-speech tagger (English) from Stanford NLP Group
Reverso.net (in French) Reverso Online (in English)	Multilingual dictionaries (from / to 15 languages)	CJK linguistics resources (for Asian languages).

The language	Stopword List	Stemmer	1,000 most frequent words
The English Language	in English (571 words from Smart)		in English
The French Language Trésor de la langue française Dictionnaire de l'Académie	in French (463 words)	for French in C more complex French stemmer	in French
The German Language German morphology	in German (603 words)	for German in C more complex German stemmer	in German
The Italian Language	in Italian (399 words)	for Italian in C	in Italian
The Spanish Language	in Spanish (307 words) in Spanish (351 words from Smart)	for Spanish in C	in Spanish
The Portuguese Language	in Portuguese (356 words) or a variant (392 words)	for Portuguese in C
The Finnish Language	in Finnish (747 words) (older version, 1,134 words)	for Finnish in C	in Finnish
The Swedish Language (Svenska Akademiens Ordbok dictionary in Swedish)	in Swedish (386 words, draft version)	for Swedish in C	in Swedish
The Arabic language Resources for Arabic	in Arabic (162 words in UTF-8) and character converter (in Perl)	simple stem2 or stem3 in C (after using our character converter)	in Arabic (UTF-8)
The Russian Language and a Russian grammar	in Russian (420 words in UTF-8) and character converter (in Perl)	for Russian in Perl a light stemmer in Java a more aggressive stemmer in Java	in Russian (UTF-8)
The Hungarian Language	in Hungarian (737 words in UTF-8)	for Hungarian in C (after removing the accents)
The Bulgarian language	in Bulgarian (258 words in UTF-8)	for Bulgarian (in perl, in UTF-8)
A Roumanian Language and a Roumanian grammar	in Roumanian (282 words in UTF-8)
The Czech language and a Czech grammar	in Czech (257 words in UTF-8)	a light stemmer in Czech in Java a more aggressive stemmer in Java
The Polish language and a Polish grammar or another grammar (in pdf) and CLEF campaign	in Polish (138 words in UTF-8) other list suggested by Kamil Wegrzynowicz (273 words in UTF-8)	our Polish light stemmer in Java (in UTF-8) another Polish stemmer suggested by Dawid Weiss and another Polish stemmer written by Andrzej Bialecki
The Persian (Farsi) language	in Persian ( words in UTF-8)	for Persian (in Java with Arabic letters) for Persian (in Java with Unicode)
The Hindi language	in Hindi (165 words in UTF-8)	our light Hindi stemmer (in Java) (or the pdf file) and the pseudo code
The Marathi language	in Marathi (99 words in UTF-8)	our light Marathi stemmer (in Java) (or the pdf file) and the pseudo code
The Bengali language	in Bengali (114 words in UTF-8)	our light Bengali stemmer (in Java) (or the pdf file) and the pseudo code

Grammars and other sources of information

Various on-line grammars and language courses are available.

Want to learn a foreign language? Have a look at Language Resource Review.

Various transliteration schemes are available from the Library of Congress.

Comments on stopword lists

In establishing a general stopword list for other languages than English, we followed the guidelines described in (Fox, 1990).

First, we sorted all word forms appearing in our corpora according to their frequency of occurrence and we extracted the 200 most frequently occurring words.

Second, we inspected this list to remove all numbers (e.g., "1994", "1"), plus all nouns and adjectives more or less directly related to the main subjects of the underlying collections. For example, the German word "Prozent" (ranking 69) or the Italian noun "Italia" (ranking 87) were removed from the final list. From our point of view, such words can be useful as indexing terms in certain circumstances.

Third, we included some non-information-bearing words, even if they did not appear in the first 200 most frequent words. For example, we added various personal or possessive pronouns (such as "meine" ("my" in German), prepositions ("nello" ("in the" in Italian)) and conjunctions ("où" ("where" in French)).

The presence of homographs represents another debatable issue, and to some extent, we had to make arbitrary decisions concerning their inclusion in stopword lists. For example, the French word "son" can be translated as "sound" or "his", and the French term "or" as "thus/therefore" or "gold"..

The resulting stopword list thus contained a large number of pronouns, articles, prepositions and conjunctions. As in various English stopword lists, there were also some verbal forms ("sein" (to be in German), "essere" (to be in Italian), "sono" ("I am" in Italian)).

Comments on stemmers

In proposing stemmers for other languages than English, we think that a "light" stemmer (removing inflections only for noun and adjectives) presents some advantages. Our stemming procedure for French is described in (Savoy, 1999). In Italian, the main inflectional rule is to modify the final character (e.g., «-o», «-a» or «-e») into another (e.g., «-i», «-e»). As a second rule, Italian morphology may also alter the final two letters (e.g., «-io» in «-o», «-co» in «-chi», «-ga» in «-ghe»). In German, a few rules may be applied to obtain the plural form of words (e.g., "Frau" into "Frauen" (woman), "Bild" into "Bilder" (picture), "Sohn" into "Söhne" (son), "Apfel" into "Äpfel" (apple)), but the suggested algorithms do not account for person and tense variations, or for the morphological variations used by verbs (we think that indexing verbs for Italian, French or German is not of primary importance compared to nouns and adjectives).

References

Dolamic, L., Savoy, J. (2010). Comparative Study of Indexing and Search Strategies for the Hindi, Marathi and Bengali Languages. ACM � Transactions on Asian Language Information Processing, 9(3). (version PDF).

Dolamic, L., Savoy, J. (2010). When Stopword Lists Make the Difference. Journal of the American Society for Information Sciences and Technology, 61(1), 200-203 (version PDF).

Savoy, J., Dolamic L. (2009). How effective is Google�s translation service in search? Communications of the ACM, 52(10), 139-143 (version PDF).

Dolamic, L., Savoy, J. (2009). Indexing and Searching Strategies for the Russian Language. Journal of the American Society for Information Sciences and Technology, 60(12), 2540-2547 (version PDF).

Dolamic, L., Savoy, J. (2009). Indexing and Stemming Approaches for the Czech Language. Information Processing & Management, 45(6), 714-720 (version PDF).

Savoy, J. (2008). Searching Strategies for the Bulgarian Language. IR Journal, 10(6), 509-529. (version PDF).

Abdou, S., Savoy, J. (2007).Monolingual experiments with Far East languages in NTCIR-6. In Proceedings of the Sixth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF).

Savoy, J. Abdou, S. (2006). UniNE at CLEF-2006: experiments with Monolingual, Bilingual and Domain-Specific and Robust Retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2006, (to appear), (version PDF).

Savoy, J. (2005). Comparative Study of Monolingual and Multilingual Search Models for Use with Asian Languages. ACM Transactions on Asian Languages Information Processing, 4(2), 163-189 (version PDF).

Abdou, S., Savoy, J. (2005). Report on CLIR Task for the NTCIR-5 Evaluation Campaign. In Proceedings of the Fifth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (version PDF).

Savoy, J. Berger, P. Y. (2005). Report on CLEF-2005 evaluation campaign: Monolingual, bilingual, and GIRT information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2005, (to appear), (version PDF).

Savoy, J. (2004a). Selection and merging strategies for multilingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF).

Savoy, J. (2004b). Data fusion for effective European monolingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF).

Savoy, J. (2004c). Report on CLIR Task for the NTCIR-4 Evaluation Campaign. In Proceedings of the Four NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF).

Savoy, J. (2003a). Report on CLEF-2003 Multilingual Tracks. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF).

Savoy, J. (2003b). Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF).

Savoy, J. (2002b). Report on CLEF-2002 Experiements: Combining multiple sources of evidence. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2002, cross-language evaluation forum, (to appear), (version PDF, my presentation at CLEF-2002, and my presentation about Amaryllis).

Savoy, J. (2002a). Morphologie et recherche d'information. Technical report (version PDF).

Savoy, J. (2001a). Report on CLEF-2001 Experiements. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2001, cross-language system evaluation campaign, (pp. 11-19). Sophia-Antipolis: ERCIM (version PDF and my presentation at CLEF-2001).

Savoy, J. (2001b). Bilingual information retrieval: CLEF-2000 experiments. Proceedings Workshop-ECSQARU-2001 management of uncertainty and imprecision in multimedia information systems, (pp. 53-63). Toulouse: IRIT (version PDF).

Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944-952.

Savoy, J. (1993). Stemming of French words based on grammatical category. Journal of the American Society for Information Science, 44(1), 1-9.

Fox C. (1990). A stop list for general text. ACM-SIGIR Forum, 24, 19-35.

Copyright

Essentially, all this means is that you can do what you like with the code, except claim another Copyright for it, or claim that it is issued under a different license. The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give this software/information to the fact that it is covered by the BSD license.

Prof. Jacques Savoy
University of Neuchatel
Computer Science Department
Rue Emile-Argand 11
CH-2000 Neuchâtel
Switzerland