German
Available Datasets
Here is a list of all datasets currently available for German.
Datasets for all types of Inflection Analyzers, Generators and Analyzer/Generators:
- Evaluation license: all simple entries for German (about 20'000 entries).
- Limited license: all simple entries for German, plus a set of the most frequently used complex German words (a total of about 100'000 entries).
- Full license: all currently available German words (currently more than 210'000).
Datasets for the German Wordformation Analyzer/Generator:
- Evaluation license: all first derivation level entries based on all simple verbs (about 20'000 relations).
- Limited license: all first derivation level entries based on all simple entries (about 150'000 relations).
- Full license: all derivation level entries for all entries (currently 310'000 relations).
Datasets for the Unknown Word Products:
- Evaluation license: all German entries and ad hoc analyzed combinations beginning with the characters 'a' and 's' (where the single words used for the combination begin with 'a' or 's').
- Full license: all relevant word formation rules and a base of all currently available lexicalized German words (currently more than 210'000).
Language Specific Features
Here are some German-specific features that need to be considered by your client application, in order to make the best use of our data analyzers.
Special Characters
Each kind of analyzer tolerates input elements that do not use special characters (e.g. the German word "mögen" written as "moegen"), tracing this information with the special "Flach" feature in the delivered output.
query -> moegen
result -> mögen
(Cat V)(Flach ouml),
(Cat N)(Flach ouml)
The German specific attribute "Flach" is used to tag forms which - according to the dictionary - are non-existent. These forms are nevertheless recognized because they correspond to valid forms which result when data is entered without language-specific keyboards. For example Kaese is the "Flach"-attributed version of the German word Käse. These are non-existent forms, nevertheless recognized by the Lemmatizer, in order to tolerate input entered without a language-specific keyboard.
| Attribute | Values | Meaning |
| Flach | auml | Same meaning as HTML entities |
| ouml | ||
| uuml | ||
| agrave | ||
| ograve | ||
| ugrave | ||
| aacute | ||
| oacute | ||
| uacute | ||
| acirc | ||
| ocirc | ||
| ucirc | ||
| ccedil | ||
Spelling Reform
| Attribute | Meaning |
| OCapRule | "Spelling Rule" |
| ORule | "Spelling Rule" |
| OSepRule | "Spelling Rule" |
| Ortho | "Spelling Variant" |
Lexemes and wordforms affected by the German spelling reform in 1998 are marked with special features. These features allow filtering - on word level - of old and new spelling variants.
Features with feature attributes:
- OCapRule
- ORule
- OSepRule
indicate the new or changed spelling rule that causes the new spelling variants.
Features with the attribute "Ortho" indicate the type of spelling variant:
| Attribute | Values | Variant | Example |
| Ortho | New | new, no preference | aufwändig |
| Old | old, no preference | aufwendig | |
| New-HV | new, main | essenziell | |
| Old-NV | old, secondary | essentiell | |
| Old-HV | old, main | Delphin | |
| New-NV | new, secondary | Delfin | |
| New-Only | new, only | Tipp | |
| Old-Only | old, only | Telefon | |
| Old-Obs | old, obsolete | Tip, Telephon | |
| CH | Swiss | Fuss (German standard: Fuß) | |
| NZZ | Neue Zürcher Zeitung | Crème |
Note:
All New* attributes indicate that the variants have been introduced by the spelling reform. These variants are incorrect according to the old spelling.
Old* means the variants existed before the reform. All Old* variants except for the variants marked Old-Obs are still correct according to the new spelling rules.