Lemmatizer
The WMTrans Lemmatizer is a program that returns the citation form of any valid word for a specified language, as used in POS taggers.A valid word is a lexicalized word. Please refer to the Unknown Word Lemmatizer if you need to recognize potentially valid words, which may not have a lexical entry. The result of a query is a list of corresponding citation forms followed by the corresponding category. The category can be used as a filter during analysis.
Implementation
We currently offer two versions of the software:- A pure Java implementation, which runs on any platform; requires at least JRE 1.4 to be installed
- On demand: a platform-specific shared library implementation (currently available for Linux), delivered with two different APIs (ANSI C/C++ and Java)
Dataset
Depending on the license agreement, the dataset delivered includes either a limited number of entries, or the full set of entries defined so far. See the language specific page for further details.Available languages
The following languages are available:- English
- German
- Italian
Please see some of the language specific features that need to be considered by the client application.
Analysis Examples
The Lemmatizer analyzes any word form, delivering as results a list of all corresponding citation forms together with their corresponding category (the Java version also offers an API function which only delivers the citation forms). Here are some examples of possible analysis interactions using the WMTrans Lemmatizer. The formal output syntax is described in the WMTrans developer zone.German Examples
query -> ging
result -> gehen
(Cat V)
query -> moegen
result -> mögen
(Cat V)(Flach ouml),
(Cat N)(Flach ouml)
query -> moegen Filter: (Cat N)
result -> mögen
(Cat N)(Flach ouml)
English Examples
query -> did
result -> do
(Cat V)
query -> cat's Filter: (Cat N)
result -> cat
(Cat N)(Contraction N+'s/Clitic),
(Cat N)(Contraction N+have/V),
(Cat N)(Contraction N+be/V)
Italian Examples
query -> andai
result -> andare
(Cat V)
query -> cacciandolo
result -> cacciare
(Cat V)(Contraction lo/Pron+V)