Lemmatizer

The WMTrans Lemmatizer is a program that returns the citation form of any valid word for a specified language, as used in POS taggers.
A valid word is a lexicalized word. Please refer to the Unknown Word Lemmatizer if you need to recognize potentially valid words, which may not have a lexical entry. The result of a query is a list of corresponding citation forms followed by the corresponding category. The category can be used as a filter during analysis.

Implementation

We currently offer two versions of the software:
  • A pure Java implementation, which runs on any platform; requires at least JRE 1.4 to be installed
  • On demand: a platform-specific shared library implementation (currently available for Linux), delivered with two different APIs (ANSI C/C++ and Java)
Both versions can be easily integrated into your own product. Please refer to the developer zone, for information how to install the chosen version and how to use the delivered APIs.

Dataset

Depending on the license agreement, the dataset delivered includes either a limited number of entries, or the full set of entries defined so far. See the language specific page for further details.

Available languages

The following languages are available:
  • English
  • German
  • Italian

Please see some of the language specific features that need to be considered by the client application.

Analysis Examples

The Lemmatizer analyzes any word form, delivering as results a list of all corresponding citation forms together with their corresponding category (the Java version also offers an API function which only delivers the citation forms). Here are some examples of possible analysis interactions using the WMTrans Lemmatizer. The formal output syntax is described in the WMTrans developer zone.

German Examples

query   -> ging
result  -> gehen
             (Cat V)
           

query   -> moegen 
result  -> mögen
              (Cat V)(Flach ouml),
              (Cat N)(Flach ouml)
           

query   -> moegen   Filter: (Cat N)
result  -> mögen
              (Cat N)(Flach ouml)
           

English Examples

query   -> did
result  -> do
             (Cat V)
           

query   -> cat's   Filter: (Cat N)
result  -> cat
             (Cat N)(Contraction N+'s/Clitic),
             (Cat N)(Contraction N+have/V),
             (Cat N)(Contraction N+be/V)

Italian Examples

query   -> andai
result  -> andare
             (Cat V)
           

query   -> cacciandolo
result  -> cacciare
             (Cat V)(Contraction lo/Pron+V)