WMTrans Unknown Word Lemmatizer

The Unknown Word Lemmatizer analyzes and returns the citation form of any valid word for a specified language, similarly to POS taggers.

In addition to the features provided by the Lemmatizer, the Unknown Word Lemmatizer has the ability to recognize unknown (i.e. not lexicalized) words based on word formation rules. This is a very useful feature for languages that are very generative in their word formation character. See the general Unknown Word Products introduction for further explanation.

Analysis Example

Analogous to the lexicalized word Lemmatizer that we also offer as a product, the Unknown Word Lemmatizer analyzes any word form, delivering as a result a list of all corresponding citation forms together with the corresponding category. It tolerates input elements that do not use special characters (e.g. the German word mögen written as moegen), tracing this information with a special feature in the delivered output.

The Unknown Word Lemmatizer is a superset of the lexicalized word Lemmatizer. If a word form is not part of any of the lexicalized entries (i.e. it cannot be found in our lexicalized words using the Lemmatizer), a second API function gives you the opportunity to analyse its structure, segmentations and word formations, and to associate one or more word formation rules with one or more of the corresponding citation forms. Refer to the API description to see details on how to use and integrate it into your program.
Here is an example of a possible analysis interaction using the Unknown Word Lemmatizer. The output syntax is the same as the one delivered by the lexicalized word Lemmatizer. The two different API function calls distinguish between the results of the Lemmatizer and the Unknown Word Lemmatizer. The formal output description can be found in the WMTrans developer zone.

Note: The type of information delivered by the unknown word analysis is consistent with the one delivered by the lexicalized word analysis. However, if you need a version which delivers more information regarding the analysis (segmentation, rules fired, word formation or derivation, linking elements, etc.), please contact us.

Lexicalized Word Function Call

query   -> sang
result  -> sang
              (Cat N)
              singen
              (Cat V)
           

query   -> sang   Filter: (Cat V)
result  -> singen
              (Cat V)
           

query   -> saenger
result  -> sänger
              (Cat N)(Flach auml)
           


Unknown Word Function Call

query   -> aufsinken
result  -> aufsinken
               (Cat V)
           

query   -> aufgesunken
result  -> aufsinken
               (Cat V)
           

query   -> skandalgeschüttelten
result  -> skandalgeschüttelt
               (Cat A)
           

query   -> abbausicheres
result  -> abbausicher
               (Cat A)