Canoo LanguageTools: Experience the power!

The Canoo Languagetools are software components for smart text processing, available in three languages: German, English, and Italian. Typical uses include integration into search engines, software for text indexing, text mining, language learning, hyperlink generation, spell checking, grammar checking, word stemming, and machine translation applications.

  • Analyze unknown (i.e. not lexicalized) words based on word formation rules.
  • Recognize unknown (i.e. not lexicalized) words based on word formation rules.
  • The Inflection Analyzer returns the citation form and morphosyntactic classification for language analysis.
  • The Lemmatizer returns the citation form of any valid word, as used in POS taggers.
  • The Recognizer recognizes any valid word.
  • The Inflection Analyzer/Generator analyzes and generates all the inflected forms of a particular lexeme. For generation, all features can be used as filters.
  • The Word Formation Analyzer/Generator analyzes and generates the first level of word formation history for any legal lexeme.
  • The Transducer Compiler is a standalone program that reads a text input file containing pairs of citation forms and word forms to compile and generate an optimized finite state transducer structure.

Canoo Unknown Word Tools:
Unknown Word Analyzer, Lemmatizer & Recognizer

In addition to the features provided by the corresponding simple products (Analyzer, Recognizer, and Lemmatizer) the Unknown Word Tools are able to analyze and recognize unknown (i.e. not lexicalized) words based on word formation rules. This is a very useful feature for languages that are very generative in their word formation character. Consider i. e. the German language: out of more than 200'000 entries, only 11% are simple base entries. This shows the potential of considering word formation rules: The current versions of the Unknown Word Tools recognize 95% of words occurring in our test corpus.



Performance

The analysis of unknown words is more complex than the simple retrieval of lexicalized elements from a finite-state machine. To reduce the performance gap between lexicalized words and unknown words analysis, an internal cache is used transparently in all Unknown Word Products. The cache stores the most frequently requested queries. In case a query stored in the cache matches a new query, the stored results are delivered without any further analysis. A transparent behavior is set by default, but the developer can specify some cache settings, like disabling the cache, changing its default size, or storing its state to a persistent device to reuse it for further sessions.


Overgeneration

Considering ad hoc word formation during unknown word analysis automatically leads to the problem of overgeneration. We manage to minimize overgeneration using three filter levels:

  • Word Formation Rule Level
    As for other Canoo Languagetools products, the data is generated using the Canoo Languagetools authoring tool which contains a complete model of word formation. Using this tool, we can tune the derivation and word formation rules that we need for the Unknown Word product. The fine tuning is based on a large text corpus analysis. For more information on our Word Manager authoring tool, see publications.

  • Generation Level
    During the generation of relevant data on words from our authoring tool we can recognize and eliminate a wide range of elements, which are not relevant for the unknown word analysis.

  • Runtime Level
    Some elements derived by overgenerating can only be found during runtime analysis. A filter checks and eliminates generated results at runtime level.

Test Results

We tested the coverage of our Unknown Word Products suite for German using a test corpus. The test corpus includes a broad range of text types, i.e. fiction, newspaper texts, scientific and technical documents. The current versions of the Unknown Word products recognize an average of 95.2% of words occurring in the test corpus. Most of the words belonging to the 4,8% of unknown words are proper names, foreign words in quotations and uncommon technical terms (Note: we did not eliminate any words from the texts before submitting them to the analysis). The average of unknown words is higher in technical reports and short newspaper articles containing a lot of proper names and foreign language quotes, whereas it is considerably lower in common descriptive writings.


Implementation

We offer the software as a pure Java implementation which runs on any platform. The only prerequisite is JRE 1.5 or higher. A small and clear API simplifies integration into your own product.


Available languages

German


Unknown Word Analyzer

The Unknown Word Analyzer returns the morphosyntactic information for a word or an unknown word (i.e. a word not found in the lexicon): e.g., citation form, word category, gender, case, tense, auxiliary verbs together with all possible decompositions and derivations and the categories of the respective elements. All features can be used as filters during the analysis.

Typical applications include

Intelligent Text Processing such as Text Analysis and Text Understanding, Summarization, Machine Translation, Parsing, Linguistic Annotation

Example

In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following example presents only these strings instead of the underlying objects. Further examples are given in the developerzone.

query   -> abbausicheres
result  -> 
           <inflection>
            abbausicher
              (Cat A)(Degree Pos)(AForm es)(ID 0)
           </inflection>
           <wf>
            abbau + sicheres
              (Cat A),
              (WFRule 
               Compounding.A-Comp.N+A.
               No-Umlaut.N+A_No_Linking_Element)
            1: abbau (Cat N)
            2: sicher (Cat A)
           </wf> 

Unknown Word Lemmatizer

The Unknown Word Lemmatizer returns the citation form and category of a word or an unknown word (i.e. a word not found in the lexicon) based on the possible words from which it has been derived or composed. The word category can be used as a filter during the analysis.

Typical applications include

Indexing, Information Retrieval, Intelligent Search, Partial Parsing, Categorization, Linguistic Annotation

Examples

In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects. Further examples are given in the developerzone.

query   -> skandalgeschüttelten
result  -> skandalgeschüttelt
             (Cat A)

query   -> aufgesunken
result  -> aufsinken
             (Cat V)


Unknown Word Recognizer

The Unknown Word Recognizer can perform two tests. In both cases, the result of a query is a simple yes/no answer (in the form of "1/0" or "true/false").

  1. Like the simple Recognizer, it may be used to determine whether a given word (be it inflected or in citation form) is a valid word form.
  2. In the second type of test, it may be used to determine whether an unknown word (i.e. a word not found in the lexicon) can be decomposed into known words or derived from known words.

Typical applications include

Spell Checker

Examples

Further examples and explanations are provided in the developerzone.

query   -> skandalgeschüttelten
result  -> true

query   -> sdfsdfsd
result  -> false
           

Canoo Language Analyzers:
Inflection Analyzer, Lemmatizer & Recognizer

Canoo Language Analyzers are simplifying the processing of natural language: The Inflection Analyzer returns the citation form and morphosyntactic classification of any valid word, in a format used by language analysis programs. The Lemmatizer returns the citation form of any valid word for a specified language, as used in POS taggers. The Recognizer is a program able to recognize any valid word, be it inflected or in citation form.



Inflection Analyzer

The Inflection Analyzer returns the citation form and morphosyntactic classification of any valid word, in a format used by language analysis programs. A query result provides a list of citation forms, followed by a list of morphosyntactic features related to the analyzed word form. All features can be used as filters during the analysis.

We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)

The product is available for German, English, and Italian.

Typical applications include

Intelligent Text Processing such as Text Analysis and Text Understanding, Summarization, Machine Translation, Parsing, Linguistic Annotation

Examples

In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.

German
query   > ging
result  > gehen
            (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 1st)(Num SG)(ID 0-1),
            (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 3rd)(Num SG)(ID 0-1)

English
query   > did
result  > do
            (Cat V)(Variety BCE)(Tense Past)(ID 0-1)

Italian
query   > andai
result  > andare
            (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num SG)(ID 0-1)


Lemmatizer

The Lemmatizer is a program that returns for any valid word its citation form and category, as used in POS taggers. The result of a query is a list of corresponding citation forms followed by the corresponding category. The category can be used as a filter during the analysis.

We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)

The product is available for German, English, and Italian.

Typical applications include

Indexing, Information Retrieval, Intelligent Search, Partial Parsing, Categorization, Linguistic Annotation

Examples

In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.

    German
    query   > moegen
    result  > mögen
                (Cat V)(Flach ouml),
                (Cat N)(Flach ouml)

    English
    query   > cat's   Filter: (Cat N)
    result  > cat
                (Cat N)(Contraction N+'s/Clitic),
                (Cat N)(Contraction N+have/V),
                (Cat N)(Contraction N+be/V)

    Italian
    query   > cacciandolo
    result  > cacciare
                (Cat V)(Contraction lo/Pron+V)
    


Recognizer

The Recognizer is a program able to determine whether a given word (be it inflected or in citation form) is a valid word form. The result of a query is a simple yes/no answer (in the form of "1/0" or "true/false").

We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)

The product is available for German, English, and Italian.

Typical applications include

Spell Checker

Examples

Explanations for these examples are provided in the developerzone.

    German
    query   > moegen
    result  > true

    query   > moexyzgen
    result  > false


    English
    query   > cat's
    result  > true

    query   > moexyzgen
    result  > false


    Italian
    query   > cacciandolo
    result  > true

    query   > moexyzgen
    result  > false
    

Canoo Analyzer/Generator products:
Two products combined into one

The Canoo Analyzer/Generator products offer you the processing of words in two complementary directions: The Inflection Analyzer/Generator analyzes and generates all the inflected forms of a particular lexeme. The Word Formation Analyzer/Generator analyzes and generates the first level of word formation history for any legal lexeme.



Inflection Analyzer/Generator

The Inflection Analyzer/Generator analyzes and generates the inflected forms of a particular lexeme. The result of an analysis query is a list of citation forms, followed by a list of morphosyntactic features related to the analyzed word form. The result of a generation query is a list of word forms, followed by a list of morphosyntactic features related to each single word form.

We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)

The product is available for German, English, and Italian.

Typical applications include

  • Intelligent Text Processing such as Text Analysis and Text Understanding, Summarization, Machine Translation, Parsing, Linguistic Annotation
  • E-Learning Applications

Examples

In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.

Analysis Examples

German
query   -> ging
result  -> gehen
             (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 1st)(Num SG)(ID 0-1),
             (Cat V)(Aux sein)(Mod Ind)(Temp Impf)(Pers 3rd)(Num SG)(ID 0-1)

English
query   -> did
result  -> do
             (Cat V)(Variety BCE)(Tense Past)(ID 0-1)

Italian
query   -> andai
result  -> andare
             (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num SG)(ID 0-1)

Generation Examples

German
query   -> haus
result  -> häuser
             (Cat N)(Gender N)(Num PL)(Case Nom)(ID 0-1),
             (Cat N)(Gender N)(Num PL)(Case Gen)(ID 0-1),
             (Cat N)(Gender N)(Num PL)(Case Acc)(ID 0-1)
           häusern
             (Cat N)(Gender N)(Num PL)(Case Dat)(ID 0-1)
           haeuser
             (Cat N)(Gender N)(Num PL)(Case Nom)(Flach auml)(ID 0-1),
             (Cat N)(Gender N)(Num PL)(Case Gen)(Flach auml)(ID 0-1),
             (Cat N)(Gender N)(Num PL)(Case Acc)(Flach auml)(ID 0-1)
           haeusern
             (Cat N)(Gender N)(Num PL)(Case Dat)(Flach auml)(ID 0-1)
           haus
             (Cat N)(Gender N)(Num SG)(Case Nom)(ID 0-1),
             (Cat N)(Gender N)(Num SG)(Case Dat)(ID 0-1),
             (Cat N)(Gender N)(Num SG)(Case Acc)(ID 0-1)
           hause
             (Cat N)(Gender N)(Num SG)(Case Dat)(ID 0-1)
           hauses
             (Cat N)(Gender N)(Num SG)(Case Gen)(ID 0-1)

    
English
query   -> damn
result  -> damn
             (Cat V)(Variety BCE)(VForm Infinitive)(ID 0-1),
             (Cat V)(Variety BCE)(Tense Present)(VForm Base)(ID 0-1)
           damned
             (Cat V)(Variety BCE)(Tense Past)(ID 0-1),
             (Cat V)(Variety BCE)(VForm Past_Participle)(ID 0-1)
           damning
             (Cat V)(Variety BCE)(VForm ing_Participle)(ID 0-1)
           damns
             (Cat V)(Variety BCE)(Tense Present)(VForm s)(ID 0-1)


Italian
query   -> andare   Filter: (Mod Ind)(Pers 1st)
result  -> andai
             (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num SG)(ID 0-1)
           andammo
             (Cat V)(Aux essere)(Mod Ind)(Temp Pass-Rem)(Pers 1st)(Num PL)(ID 0-1)
           andavo
             (Cat V)(Aux essere)(Mod Ind)(Temp Impf)(Pers 1st)(Num SG)(ID 0-1)
           andavamo
             (Cat V)(Aux essere)(Mod Ind)(Temp Impf)(Pers 1st)(Num PL)(ID 0-1)
           vado
             (Cat V)(Aux essere)(Mod Ind)(Temp Pres)(Pers 1st)(Num SG)(ID 0-1)
           andiamo
             (Cat V)(Aux essere)(Mod Ind)(Temp Pres)(Pers 1st)(Num PL)(ID 0-1)
           andrò
             (Cat V)(Aux essere)(Mod Ind)(Temp Fut)(Pers 1st)(Num SG)(ID 0-1)
           andremo
             (Cat V)(Aux essere)(Mod Ind)(Temp Fut)(Pers 1st)(Num PL)(ID 0-1)


Word Formation Analyzer/Generator

The Word Formation Analyzer/Generator analyzes and generates the first level of word formation history for any legal lexeme. The tool expects the input lexeme to be in its citation form. The result of an analysis query is a list of source lexemes, from which the given lexeme derives. The result of a generation query is a list of derived lexemes, created by derivation and word formation. All features can be used as filters during the analysis and generation.

We offer the software as a pure Java implementation which runs on any platform (requires JRE 1.5 or higher). A small and clear API simplifies the integration into your own product. (Upon request, the product is also available as a platform-specific shared-library implementation for Linux.)

The product is available for German, English, and Italian.

Typical applications include

  • E-Learning Applications
  • Information Retrieval and Intelligent Search (e.g. query expansion)

Examples

In order to facilitate the integration of the product, the Java API returns structured Java objects (see developerzone for details). These objects may be easily converted into simple strings, by means of a method included in the API. The following examples present only these strings instead of the underlying objects.

Analysis Examples

German
query   -> kennenlernen
result  -> kennen
             (Cat V)(Aux haben)
           lernen
             (Cat V)(Aux haben)

English
query   -> countdown
result  -> count
             (Cat V)(Variety BCE)
           down
             (Cat Adv)(Variety BCE)

Italian
query   -> appartenenza
result  -> appartenere
             (Cat V)(Aux avere)(Aux essere)

Generation Examples

German
query   -> mahnen
result  -> abmahnen
             (Cat V)(Aux haben)
           anmahnen
             (Cat V)(Aux haben)
           einmahnen
             (Cat V)(Aux haben)
           ermahnen
             (Cat V)(Aux haben)
           gemahnen
             (Cat V)(Aux haben)
           gemahnt
             (Cat A)(Lexeme mahnen)
           mahnbescheid
             (Cat N)(Gender M)
           mahnbrief
             (Cat N)(Gender M)
           mahnend
             (Cat A)
           mahner
             (Cat N)(Gender M)
           mahngebühr
             (Cat N)(Gender F)
           mahnmal
             (Cat N)(Gender N)(Plural e),
             (Cat N)(Gender N)(Plural er)
           mahnruf
             (Cat N)(Gender M)
           mahnschreiben
             (Cat N)(Gender N)
           mahnstütte
             (Cat N)(Gender F)
           mahnung
             (Cat N)(Gender F)
           mahnverfahren
             (Cat N)(Gender N)
           mahnwache
             (Cat N)(Gender F)
           mahnwort
             (Cat N)(Gender N)
           mahnzeichen
             (Cat N)(Gender N)
           mahnzettel
             (Cat N)(Gender M)
           vermahnen
             (Cat V)(Aux haben)


English
query   -> appear
result  -> apparent
             (Cat A)(Variety BCE)
           appearance
             (Cat N)(Variety BCE)
           disappear
             (Cat V)(Variety BCE)
           pre-appear
             (Cat V)(Variety BCE)
           re-appear
             (Cat V)(Variety BCE)
           reappear
             (Cat V)(Variety BCE)


Italian
query   -> bosco
result  -> abbracciabosco
             (Cat N)(Gender M)
           boscaglia
             (Cat N)(Gender F)
           boscaiolo
             (Cat N)(Gender M)
           boschetto
             (Cat N)(Gender M)
           boschivo
             (Manner Qual)(Cat A)(Manner Qual)
           boscoso
             (Manner Qual)(Cat A)(Manner Qual)
           diboscare
             (Cat V)(Aux avere)
           disboscare
             (Cat V)(Aux avere)
           guardaboschi
             (Cat N)(Gender V)
           imboscare
             (Cat V)(Aux avere)
           imboschire
             (Cat V)(Aux avere)
           sottobosco
             (Cat N)(Gender M)
           tagliaboschi
             (Cat N)(Gender M)

Canoo Transducer Compiler

The Transducer Compiler is a standalone program that reads a text input file containing pairs of citation forms and word forms to compile and generate an optimized finite state transducer structure.

This software is available as platform-specific implementation (for Linux). We offer compilers for all Canoo Languagetools products.


Input File Examples

An input *.src file is represented by a sequence of lines in which the compiler must find three elements: a citation form, a word form and an index representing the reference to the feature table, which must be delivered separately (file *.tab).

bauen baue 1
bauen baust 1
bauen baut 1
bauen bauens 2
...