Language Tools for German



Available Datasets

Datasets for all types of Inflection Analyzers, Generators and Analyzer/Generators:

  • Evaluation license: all nouns, verbs, adjectives and adverbs starting with the letters 'a' and 's', plus all other types of entries (for a total of about 40'000 entries).
  • Full license: all currently available German words (currently almost 300'000).

Datasets for the German Wordformation Analyzer/Generator:

  • Evaluation license: all derivation level entries relating to all nouns, verbs, adjectives and adverbs starting with the letters 'a' and 's', plus all other types of entries (a total of about 30'000 relations).
  • Full license: all derivation level entries for all entries (currently 350'000 relations).

Datasets for the Unknown Word Products:

  • Evaluation license: all German entries and ad hoc analyzed combinations beginning with the characters 'a' and 's' (where the single words used for the combination begin with 'a' or 's').
  • Full license: all relevant word formation rules and a base of all currently available lexicalized German words (currently more than 210'000).


Language Specific Features


Special Characters

Each kind of analyzer tolerates input elements that do not use special characters (e.g. the German word "mögen" written as "moegen"), tracing this information with the special "Flach" feature in the delivered output.


query   -> moegen
result  -> mögen
              (Cat V)(Flach ouml),
              (Cat N)(Flach ouml)

The German specific attribute "Flach" is used to tag forms which - according to the dictionary - are non-existent. These forms are nevertheless recognized because they correspond to valid forms which result when data is entered without language-specific keyboards. For example Kaese is the "Flach"-attributed version of the German word Käse. These are non-existent forms, nevertheless recognized by the Lemmatizer, in order to tolerate input entered without a language-specific keyboard.


Spelling Reform

Lexemes and wordforms affected by the German spelling reform in (1996-2006) are marked with special features. These features allow filtering - on word level - of old and new spelling variants.


Language Tools for English



Available Datasets

The following datasets are deliverable for all types of Analyzers, Generators and Analyzer/Generators:

  • Evaluation license: all nouns, verbs, adjectives and adverbs starting with the letters 'a' to 'd', plus all other types of entries (a total of about 11'000 entries).
  • Full license: all currently available English words (currently more than 50'000), with contraction elements analysis.

The following datasets are available for the English Word Formation Analyzer/Generator:

  • Evaluation license: all derivation level entries relating to all nouns, verbs, adjectives and adverbs starting with the letters 'a' to 'd', plus all other types of entries (a total of about 7'000 relations).
  • Full license: all derivation level entries for all entries (currently 43'000 relations).

Language Specific Features

Here are some English-specific features that need to be considered by your client application in order to make the best use of our data analyzers.


British and American English

Our English analyzers are able to distinguish between different spelling variants. We adopted British Common English (BCE) as standard spelling type. Special features mark American and British spelling variants.

The features are:
  • (SpellVar BCE): British Common English spelling
  • (SpellVar AE): exclusive American spelling variant, used instead of BCE spelling. Example: BCE colour, AE color
  • (SpellVar ae): optional American spelling variant, used as well as BCE spelling. Example: BCE travelled, ae traveled
  • (SpellVar be): optional British spelling variant, used as well as BCE spelling. Example: BCE realise, be realize

With this information you can set a filter to analyze your text according to your specific criteria. SpellVar-Features differ from Variety-Features. Variety-Features are used to mark regional varieties of lexical items, such as the American word "billfold" for BCE "wallet", "mailman" vs "postman".


Contractions

The English version is able to analyze and recognize word forms with apostrophes:

  • Possessive forms of nouns; this includes singular word forms like "entry's", as well as plural word forms like "points'", including exceptions.
  • Contractions of auxiliary + not such as "doesn't", "haven't".

Please note: If you require a single analysis of a word form with an apostrophe, do not use the apostrophe character as a separator within your application.

Here is an example for the Lemmatizer:


query   -> cat's
filter  -> (Cat N)
result  -> cat
               (Cat N)(Contraction N+'s/Clitic)
               (Cat N)(Contraction N+have/V)
               (Cat N)(Contraction N+be/V)

The Contraction Feature

The contraction feature is used to specify contraction elements included in the answer. The above example shows the Lemmatizer results for the query "cat's". The single entities within the contraction feature are separated by the character '+'. An entity is described uniquely by its category, if it is an "open" entity, i.e. all entries of the same category (following specific restrictions) could potentially be applied to an entity. On the other hand, an entity is specified by the pair citation form "/" category, if it describes an element from a finite set of possibilities.


Language Tools for Italian



Available Datasets

The following datasets are deliverable for all types of Analyzers, Generators and Analyzer/Generators:

  • Evaluation license: all nouns, verbs, adjectives and adverbs starting with the letters 'a' to 'c', plus all other types of entries (for 13'000+ entries).
  • Full license: all currently available Italian words (50'000+), with contraction elements analysis.

The following datasets are available for the Italian Wordformation Analyzer/Generator:

  • Evaluation license: all derivation level entries relating to all nouns, verbs, adjectives and adverbs starting with the letters 'a' to 'c', plus all other types of entries (a total of about 10'000 relations).
  • Full license: all derivation level entries for all entries (40'000+).

Language Specific Features

Here are some Italian-specific features that need to be considered by your client application in order to make the best use of our data analyzers.


Contractions

The Italian version is able to analyze and recognize cliticized word forms, like "dammelo", "dimmelo","vattene", etc., where forms of more lexemes are combined into one unique graphic word. This is a very useful feature when analyzing text, because clitics are used very often in Italian.

Here is an example for the Lemmatizer:


query   -> vattene
result  -> andare
               (Cat V)
               (Contraction ti/Pron+ne/Pron+V)

Here is an example for the Analyzer:


query   -> spiegatemela
result  -> spiegare
              (Cat V)(Aux avere)
              (Mod Imp)(Pers 2nd)(Num PL)
              (Contraction mi/Pron+la/Pron+V)



The Contraction Feature

The contraction feature is used to specify contraction elements included in the reply. The above example shows the results for the queries "vattene" and "spiegatemela". The single entities within the contraction feature are separated by the character '+'. An entity is described uniquely by its category, if it is an "open" entity, i.e. all entries of the same category (following specific restrictions) could potentially be applied to an entity. On the other hand, an entity is specified by the pair citation form "/" category, if it describes an element from a finite set of possibilities.