|
Université de Neuchâtel |
| annuaire | plan du site | accès | contact |
| IR Multilingual Resources at UniNE | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Have a look at the CLEF site
(European languages) or NTCIR
(Asian languages) providing other information about multilingual retrieval
|
|
|
|
|
|
Babel Fish (Systran system)
|
||
|
EuroWordNet (license available from ELRA)
|
||
|
Part-of-speech tagger (English) from University of Edinburgh (UK).
|
||
| Others CLIR resources from University of Maryland at College Park | ||
|
|
Multilingual dictionaries
(from / to 15 languages) |
CJK linguistics resources (for Asian languages).
|
|
|
|
|
|
| The English Language |
in English (571 words from Smart)
|
in English | |
| The French Language Trésor de la langue française Dictionnaire de l'Académie |
in French (463 words)
|
for French in C more complex French stemmer |
in French |
| The German Language German morphology |
in German (603 words)
|
for German in C more complex German stemmer more information available |
in German |
| The Italian Language |
in Italian (430 words)
|
for Italian in C more information available |
in Italian |
| The Spanish and Portuguese Language | for Spanish in C | in Spanish | |
| The Spanish and Portuguese Language | for Portuguese in C | ||
| The Finnish Language | for Finnish in C | in Finnish | |
| The Swedish Language (Svenska Akademiens Ordbok dictionary in Swedish) |
in Swedish (386 words,
draft version) |
for Swedish in C
|
in Swedish |
| La langue arabe Resources for Arabic |
in Arabic (162 words in UTF-8) and character converter (in Perl) |
simple stem2 or stem3 in C (after using our character converter) |
in Arabic (UTF-8) |
| A Russian grammar | in Russian (420 words in UTF-8) and character converter (in Perl) |
for Russian in Perl a light stemmer in Java a more aggressive stemmer in Java |
in Russian (UTF-8) | The Hungarian Language | in Hungarian (737 words in UTF-8) | for Hungarian in C (after removing the accents) |
| The Bulgarian Language | in Bulgarian (258 words in UTF-8) | for Bulgarian (in perl, in UTF-8) | |
| A Roumanian grammar | in Roumanian (282 words in UTF-8) | ||
| A Czech grammar | in Czech (257 words in UTF-8) | a light stemmer in Czech in Java a more aggressive stemmer in Java |
|
| The Polish Language a Polish grammar (in pdf) |
in Polish (138 words in UTF-8) other list suggested by Kamil Wegrzynowicz (273 words in UTF-8) |
a Polish stemmer suggested by Dawid Weiss another Polish stemmer written by Andrzej Bialecki | |
| The Persian (Farsi) Language | in Persian ( words in UTF-8) | for Persian (in Java with Arabic letters) for Persian (in Java with Unicode) |
|
| The Hindi Language | in Hindi (165 words in UTF-8) | ||
| The Marathi Language | in Marathi (99 words in UTF-8) | ||
| The Bengali Language | in Bengali (114 words in UTF-8) |
Various on-line grammars and language courses are available.
Various transliteration schemes are available from the Library of Congress.
In establishing a general stopword list for other languages than English, we followed the guidelines described in (Fox, 1990).
First, we sorted all word forms appearing in our corpora according to their frequency of occurrence and we extracted the 200 most frequently occurring words.
Second, we inspected this list to remove all numbers (e.g., "1994", "1"), plus all nouns and adjectives more or less directly related to the main subjects of the underlying collections. For example, the German word "Prozent" (ranking 69) or the Italian noun "Italia" (ranking 87) were removed from the final list. From our point of view, such words can be useful as indexing terms in certain circumstances.
Third, we included some non-information-bearing words, even if they did not appear in the first 200 most frequent words. For example, we added various personal or possessive pronouns (such as "meine" ("my" in German), prepositions ("nello" ("in the" in Italian)) and conjunctions ("où" ("where" in French)).
The presence of homographs represents another debatable issue, and to some extent, we had to make arbitrary decisions concerning their inclusion in stopword lists. For example, the French word "son" can be translated as "sound" or "his", and the French term "or" as "thus/therefore" or "gold"..
The resulting stopword list thus contained a large number of pronouns, articles, prepositions and conjunctions. As in various English stopword lists, there were also some verbal forms ("sein" (to be in German), "essere" (to be in Italian), "sono" ("I am" in Italian)).
In proposing stemmers for other languages than English, we think that a "light" stemmer (removing inflections only for noun and adjectives) presents some advantages. Our stemming procedure for French is described in (Savoy, 1999). In Italian, the main inflectional rule is to modify the final character (e.g., «-o», «-a» or «-e») into another (e.g., «-i», «-e»). As a second rule, Italian morphology may also alter the final two letters (e.g., «-io» in «-o», «-co» in «-chi», «-ga» in «-ghe»). In German, a few rules may be applied to obtain the plural form of words (e.g., "Frau" into "Frauen" (woman), "Bild" into "Bilder" (picture), "Sohn" into "Söhne" (son), "Apfel" into "Äpfel" (apple)), but the suggested algorithms do not account for person and tense variations, or for the morphological variations used by verbs (we think that indexing verbs for Italian, French or German is not of primary importance compared to nouns and adjectives).
Abdou, S., Savoy, J. (2007).Monolingual experiments with Far East languages in NTCIR-6. In Proceedings of the Sixth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF).
Savoy, J. Abdou, S. (2006). UniNE at CLEF-2006: experiments with Monolingual, Bilingual and Domain-Specific and Robust Retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2006, (to appear), (version PDF).
Abdou, S., Savoy, J. (2005). Report on CLIR Task for the NTCIR-5 Evaluation Campaign. In Proceedings of the Fifth NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (version PDF).
Savoy, J. Berger, P. Y. (2005). Report on CLEF-2005 evaluation campaign: Monolingual, bilingual, and GIRT information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), CLEF-2005, (to appear), (version PDF).
Savoy, J. (2004a). Selection and merging strategies for multilingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF).
Savoy, J. (2004b). Data fusion for effective European monolingual information retrieval. In C. Peters, Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (Ed.), Advances in Cross-Language Retrieval CLEF-2004, (to appear), (version PDF).
Savoy, J. (2004c). Report on CLIR Task for the NTCIR-4 Evaluation Campaign. In Proceedings of the Four NTCIR Workshop on research in Information Retrieval, Automatic Text Summarization and Question Answering, (to appear), (version PDF).
Savoy, J. (2003a). Report on CLEF-2003 Multilingual Tracks. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF).
Savoy, J. (2003b). Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2003, cross-language evaluation forum, (to appear), (version PDF).
Savoy, J. (2002b). Report on CLEF-2002 Experiements: Combining multiple sources of evidence. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2002, cross-language evaluation forum, (to appear), (version PDF, my presentation at CLEF-2002, and my presentation about Amaryllis).
Savoy, J. (2002a). Morphologie et recherche d'information. Technical report (version PDF).
Savoy, J. (2001a). Report on CLEF-2001 Experiements. In C. Peters, Braschler, M., Gonzalo, J., Kluck, M. (Ed.), Results of the CLEF-2001, cross-language system evaluation campaign, (pp. 11-19). Sophia-Antipolis: ERCIM (version PDF and my presentation at CLEF-2001).
Savoy, J. (2001b). Bilingual information retrieval: CLEF-2000 experiments. Proceedings Workshop-ECSQARU-2001 management of uncertainty and imprecision in multimedia information systems, (pp. 53-63). Toulouse: IRIT (version PDF).
Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944-952.
Savoy, J. (1993). Stemming of French words based on grammatical category. Journal of the American Society for Information Science, 44(1), 1-9.
Fox C. (1990). A stop list for general text. ACM-SIGIR Forum, 24, 19-35.
All the software/information given out on this Web site is covered by the BSD License (see http://www.opensource.org/licenses/bsd-license.html), with Copyright (c) 2005, Jacques Savoy.
Essentially, all this means is that you can do what you like with the code, except claim another Copyright for it, or claim that it is issued under a different license. The software is also issued without warranties, which means that if anyone suffers through its use, they cannot come back and sue you. You also have to alert anyone to whom you give this software/information to the fact that it is covered by the BSD license.