Canoo LanguageTools: Experience the power!
The Canoo Languagetools are software components for smart text processing, available in three languages: German, English, and Italian. Typical uses include integration into search engines, software for text indexing, text mining, language learning, hyperlink generation, spell checking, grammar checking, word stemming, and machine translation applications.
Canoo Unknown Word Tools:
Unknown Word Analyzer, Lemmatizer & Recognizer
In addition to the features provided by the corresponding simple products (Analyzer, Recognizer, and Lemmatizer) the Unknown Word Tools are able to analyze and recognize unknown (i.e. not lexicalized) words based on word formation rules. This is a very useful feature for languages that are very generative in their word formation character. Consider i. e. the German language: out of more than 200'000 entries, only 11% are simple base entries. This shows the potential of considering word formation rules: The current versions of the Unknown Word Tools recognize 95% of words occurring in our test corpus.
Performance
The analysis of unknown words is more complex than the simple retrieval of lexicalized elements from a finite-state machine. To reduce the performance gap between lexicalized words and unknown words analysis, an internal cache is used transparently in all Unknown Word Products. The cache stores the most frequently requested queries. In case a query stored in the cache matches a new query, the stored results are delivered without any further analysis. A transparent behavior is set by default, but the developer can specify some cache settings, like disabling the cache, changing its default size, or storing its state to a persistent device to reuse it for further sessions.
Overgeneration
Considering ad hoc word formation during unknown word analysis automatically leads to the problem of overgeneration. We manage to minimize overgeneration using three filter levels:
- Word Formation Rule Level
As for other Canoo Languagetools products, the data is generated using the Canoo Languagetools authoring tool which contains a complete model of word formation. Using this tool, we can tune the derivation and word formation rules that we need for the Unknown Word product. The fine tuning is based on a large text corpus analysis. For more information on our Word Manager authoring tool, see publications. - Generation Level
During the generation of relevant data on words from our authoring tool we can recognize and eliminate a wide range of elements, which are not relevant for the unknown word analysis. - Runtime Level
Some elements derived by overgenerating can only be found during runtime analysis. A filter checks and eliminates generated results at runtime level.
Test Results
We tested the coverage of our Unknown Word Products suite for German using a test corpus. The test corpus includes a broad range of text types, i.e. fiction, newspaper texts, scientific and technical documents. The current versions of the Unknown Word products recognize an average of 95.2% of words occurring in the test corpus. Most of the words belonging to the 4,8% of unknown words are proper names, foreign words in quotations and uncommon technical terms (Note: we did not eliminate any words from the texts before submitting them to the analysis). The average of unknown words is higher in technical reports and short newspaper articles containing a lot of proper names and foreign language quotes, whereas it is considerably lower in common descriptive writings.
Implementation
We offer the software as a pure Java implementation which runs on any platform. The only prerequisite is JRE 1.5 or higher. A small and clear API simplifies integration into your own product.
Available languages
Unknown Word Analyzer
The Unknown Word Analyzer returns the morphosyntactic information for a word or an unknown word (i.e. a word not found in the lexicon): e.g., citation form, word category, gender, case, tense, auxiliary verbs together with all possible decompositions and derivations and the categories of the respective elements. All features can be used as filters during the analysis.
Typical applications include
Intelligent Text Processing such as Text Analysis and Text Understanding, Summarization, Machine Translation, Parsing, Linguistic Annotation
Example
In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following example presents only these strings instead of the underlying objects. Further examples are given in the developerzone.
query -> abbausicheres
result ->
<inflection>
abbausicher
(Cat A)(Degree Pos)(AForm es)(ID 0)
</inflection>
<wf>
abbau + sicheres
(Cat A),
(WFRule
Compounding.A-Comp.N+A.
No-Umlaut.N+A_No_Linking_Element)
1: abbau (Cat N)
2: sicher (Cat A)
</wf>
Unknown Word Lemmatizer
The Unknown Word Lemmatizer returns the citation form and category of a word or an unknown word (i.e. a word not found in the lexicon) based on the possible words from which it has been derived or composed. The word category can be used as a filter during the analysis.
Typical applications include
Indexing, Information Retrieval, Intelligent Search, Partial Parsing, Categorization, Linguistic Annotation
Examples
In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects. Further examples are given in the developerzone.
query -> skandalgeschüttelten
result -> skandalgeschüttelt
(Cat A)
query -> aufgesunken
result -> aufsinken
(Cat V)
Unknown Word Recognizer
The Unknown Word Recognizer can perform two tests. In both cases, the result of a query is a simple yes/no answer (in the form of "1/0" or "true/false").
- Like the simple Recognizer, it may be used to determine whether a given word (be it inflected or in citation form) is a valid word form.
- In the second type of test, it may be used to determine whether an unknown word (i.e. a word not found in the lexicon) can be decomposed into known words or derived from known words.
Typical applications include
Spell Checker
Examples
Further examples and explanations are provided in the developerzone.
query -> skandalgeschüttelten
result -> true
query -> sdfsdfsd
result -> false
Canoo Language Analyzers:
Inflection Analyzer, Lemmatizer & Recognizer
Canoo Language Analyzers are simplifying the processing of natural language: The Inflection Analyzer returns the citation form and morphosyntactic classification of any valid word, in a format used by language analysis programs. The Lemmatizer returns the citation form of any valid word for a specified language, as used in POS taggers. The Recognizer is a program able to recognize any valid word, be it inflected or in citation form.
Inflection Analyzer
The Inflection Analyzer returns the citation form and morphosyntactic classification of any valid word, in a format used by language analysis programs. A query result provides a list of citation forms, followed by a list of morphosyntactic features related to the analyzed word form. All features can be used as filters during the analysis.
We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)
The product is available for German, English, and Italian.
Typical applications include
Intelligent Text Processing such as Text Analysis and Text Understanding, Summarization, Machine Translation, Parsing, Linguistic Annotation
Examples
In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.
German query > ging result > gehen (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 1st)(Num SG)(ID 0-1), (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 3rd)(Num SG)(ID 0-1) English query > did result > do (Cat V)(Variety BCE)(Tense Past)(ID 0-1) Italian query > andai result > andare (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num SG)(ID 0-1)
Lemmatizer
The Lemmatizer is a program that returns for any valid word its citation form and category, as used in POS taggers. The result of a query is a list of corresponding citation forms followed by the corresponding category. The category can be used as a filter during the analysis.
We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)
The product is available for German, English, and Italian.
Typical applications include
Indexing, Information Retrieval, Intelligent Search, Partial Parsing, Categorization, Linguistic Annotation
Examples
In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.
German
query > moegen
result > mögen
(Cat V)(Flach ouml),
(Cat N)(Flach ouml)
English
query > cat's Filter: (Cat N)
result > cat
(Cat N)(Contraction N+'s/Clitic),
(Cat N)(Contraction N+have/V),
(Cat N)(Contraction N+be/V)
Italian
query > cacciandolo
result > cacciare
(Cat V)(Contraction lo/Pron+V)
Recognizer
The Recognizer is a program able to determine whether a given word (be it inflected or in citation form) is a valid word form. The result of a query is a simple yes/no answer (in the form of "1/0" or "true/false").
We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)
The product is available for German, English, and Italian.
Typical applications include
Spell Checker
Examples
Explanations for these examples are provided in the developerzone.
German
query > moegen
result > true
query > moexyzgen
result > false
English
query > cat's
result > true
query > moexyzgen
result > false
Italian
query > cacciandolo
result > true
query > moexyzgen
result > false
Canoo Analyzer/Generator products:
Two products combined into one
The Canoo Analyzer/Generator products offer you the processing of words in two complementary directions: The Inflection Analyzer/Generator analyzes and generates all the inflected forms of a particular lexeme. The Word Formation Analyzer/Generator analyzes and generates the first level of word formation history for any legal lexeme.
Inflection Analyzer/Generator
The Inflection Analyzer/Generator analyzes and generates the inflected forms of a particular lexeme. The result of an analysis query is a list of citation forms, followed by a list of morphosyntactic features related to the analyzed word form. The result of a generation query is a list of word forms, followed by a list of morphosyntactic features related to each single word form.
We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)
The product is available for German, English, and Italian.
Typical applications include
- Intelligent Text Processing such as Text Analysis and Text Understanding, Summarization, Machine Translation, Parsing, Linguistic Annotation
- E-Learning Applications
Examples
In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.
Analysis Examples
German query -> ging result -> gehen (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 1st)(Num SG)(ID 0-1), (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 3rd)(Num SG)(ID 0-1) English query -> did result -> do (Cat V)(Variety BCE)(Tense Past)(ID 0-1) Italian query -> andai result -> andare (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num SG)(ID 0-1)
Generation Examples
German query -> haus result -> häuser (Cat N)(Gender N)(Num PL)(Case Nom)(ID 0-1), (Cat N)(Gender N)(Num PL)(Case Gen)(ID 0-1), (Cat N)(Gender N)(Num PL)(Case Acc)(ID 0-1) häusern (Cat N)(Gender N)(Num PL)(Case Dat)(ID 0-1) haeuser (Cat N)(Gender N)(Num PL)(Case Nom)(Flach auml)(ID 0-1), (Cat N)(Gender N)(Num PL)(Case Gen)(Flach auml)(ID 0-1), (Cat N)(Gender N)(Num PL)(Case Acc)(Flach auml)(ID 0-1) haeusern (Cat N)(Gender N)(Num PL)(Case Dat)(Flach auml)(ID 0-1) haus (Cat N)(Gender N)(Num SG)(Case Nom)(ID 0-1), (Cat N)(Gender N)(Num SG)(Case Dat)(ID 0-1), (Cat N)(Gender N)(Num SG)(Case Acc)(ID 0-1) hause (Cat N)(Gender N)(Num SG)(Case Dat)(ID 0-1) hauses (Cat N)(Gender N)(Num SG)(Case Gen)(ID 0-1) English query -> damn result -> damn (Cat V)(Variety BCE)(VForm Infinitive)(ID 0-1), (Cat V)(Variety BCE)(Tense Present)(VForm Base)(ID 0-1) damned (Cat V)(Variety BCE)(Tense Past)(ID 0-1), (Cat V)(Variety BCE)(VForm Past_Participle)(ID 0-1) damning (Cat V)(Variety BCE)(VForm ing_Participle)(ID 0-1) damns (Cat V)(Variety BCE)(Tense Present)(VForm s)(ID 0-1) Italian query -> andare Filter: (Mod Ind)(Pers 1st) result -> andai (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num SG)(ID 0-1) andammo (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num PL)(ID 0-1) andavo (Cat V)(Aux essere)(Mod Ind)(Temp Impf)(Pers 1st)(Num SG)(ID 0-1) andavamo (Cat V)(Aux essere)(Mod Ind)(Temp Impf)(Pers 1st)(Num PL)(ID 0-1) vado (Cat V)(Aux essere)(Mod Ind)(Temp Pres)(Pers 1st)(Num SG)(ID 0-1) andiamo (Cat V)(Aux essere)(Mod Ind)(Temp Pres)(Pers 1st)(Num PL)(ID 0-1) andrò (Cat V)(Aux essere)(Mod Ind)(Temp Fut)(Pers 1st)(Num SG)(ID 0-1) andremo (Cat V)(Aux essere)(Mod Ind)(Temp Fut)(Pers 1st)(Num PL)(ID 0-1)
Word Formation Analyzer/Generator
The Word Formation Analyzer/Generator analyzes and generates the first level of word formation history for any legal lexeme. The tool expects the input lexeme to be in its citation form. The result of an analysis query is a list of source lexemes, from which the given lexeme derives. The result of a generation query is a list of derived lexemes, created by derivation and word formation. All features can be used as filters during the analysis and generation.
We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)
The product is available for German, English, and Italian.
Typical applications include
- E-Learning Applications
- Information Retrieval and Intelligent Search (e.g. query expansion)
Examples
In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.
Analysis Examples
German query -> kennenlernen result -> kennen (Cat V)(Aux haben) lernen (Cat V)(Aux haben) English query -> countdown result -> count (Cat V)(Variety BCE) down (Cat Adv)(Variety BCE) Italian query -> appartenenza result -> appartenere (Cat V)(Aux avere)(Aux essere)
Generation Examples
German query -> mahnen result -> abmahnen (Cat V)(Aux haben) anmahnen (Cat V)(Aux haben) einmahnen (Cat V)(Aux haben) ermahnen (Cat V)(Aux haben) gemahnen (Cat V)(Aux haben) gemahnt (Cat A)(Lexeme mahnen) mahnbescheid (Cat N)(Gender M) mahnbrief (Cat N)(Gender M) mahnend (Cat A) mahner (Cat N)(Gender M) mahngebühr (Cat N)(Gender F) mahnmal (Cat N)(Gender N)(Plural e), (Cat N)(Gender N)(Plural er) mahnruf (Cat N)(Gender M) mahnschreiben (Cat N)(Gender N) mahnstütte (Cat N)(Gender F) mahnung (Cat N)(Gender F) mahnverfahren (Cat N)(Gender N) mahnwache (Cat N)(Gender F) mahnwort (Cat N)(Gender N) mahnzeichen (Cat N)(Gender N) mahnzettel (Cat N)(Gender M) vermahnen (Cat V)(Aux haben) English query -> appear result -> apparent (Cat A)(Variety BCE) appearance (Cat N)(Variety BCE) disappear (Cat V)(Variety BCE) pre-appear (Cat V)(Variety BCE) re-appear (Cat V)(Variety BCE) reappear (Cat V)(Variety BCE) Italian query -> bosco result -> abbracciabosco (Cat N)(Gender M) boscaglia (Cat N)(Gender F) boscaiolo (Cat N)(Gender M) boschetto (Cat N)(Gender M) boschivo (Manner Qual)(Cat A)(Manner Qual) boscoso (Manner Qual)(Cat A)(Manner Qual) diboscare (Cat V)(Aux avere) disboscare (Cat V)(Aux avere) guardaboschi (Cat N)(Gender V) imboscare (Cat V)(Aux avere) imboschire (Cat V)(Aux avere) sottobosco (Cat N)(Gender M) tagliaboschi (Cat N)(Gender M)
Canoo Transducer Compiler
The Transducer Compiler is a standalone program that reads a text input file containing pairs of citation forms and word forms to compile and generate an optimized finite state transducer structure.
This software is available as platform-specific implementation (for Linux). We offer compilers for all Canoo Languagetools products.
Input File Examples
An input *.src file is represented by a sequence of lines in which the compiler must find three elements: a citation form, a word form and an index representing the reference to the feature table, which must be delivered separately (file *.tab).
bauen baue 1 bauen baust 1 bauen baut 1 bauen bauens 2 ...



