WMTrans Unknown Word Products
The Unknown Word Products consist of three related products:
They analyze and return the requested output, following the corresponding syntax for each single product. In addition to the features provided by the corresponding simple products (Inflection Analyzer, Lemmatizer, Recognizer) the Unknown Word products have the ability to analyze and recognize unknown (i.e. not lexicalized) words based on word formation rules. This is a very useful feature for languages that are very generative in their word formation character. Consider for example the German language: out of more than 200'000 entries (the current number of lexicalized entries in the Canoo dictionary), only 11% are simple base entries. This shows the potential of considering word formation rules, as performed by the Unknown Word Products, during analysis.
Performance
The analysis of unknown words is more complex than the simple retrieval of lexicalized elements from a finite-state machine. To reduce the performance gap between lexicalized words and unknown words analysis, an internal cache is used transparently in all Unknown Word Products. The cache stores the most frequently requested queries. In case a query stored in the cache matches a new query, the stored results are delivered without any further analysis. A transparent behaviour is set by default, but the developer can specify some cache settings, like disabling the cache, changing its default size, or storing its state to a persistent device to reuse it for further sessions. See more detailed information in the description for each Unknown Word Product API.
Overgeneration
Considering ad hoc word formation during unknown word analysis automatically leads to the problem of overgeneration. We manage to minimize overgeneration using three filter levels:
- Word Formation Rule Level
As for other WMTrans products, the data is generated using the WMTrans authoring tool, which contains a complete model of word formation. Using this tool, we can tune the derivation and word formation rules that we need for the Unknown Word product. The fine tuning is based on a large text corpus analysis. For more information on our Word Manager authoring tool, see publications. - Generation Level
During the generation of relevant data on words from our WMTrans authoring tool, we can recognize and eliminate a wide range of elements, which are not relevant for the unknown word analysis. - Runtime Level
Some elements derived by overgenerating can only be found during runtime analysis. A filter checks and eliminates generated results at runtime level.
Test Results
We tested the coverage of our Unknown Word Products suite for German using a test corpus. The test corpus includes a broad range of text types, i.e. fiction, newspaper texts, scientific and technical documents. The current versions of the Unknown Word products recognize an average of 95.2% of words occurring in the test corpus. Most of the words belonging to the 4,8% of unknown words are proper names, foreign words in quotations and uncommon technical terms (Note: we did not eliminate any words from the texts before submitting them to the analysis). The average of unknown words is higher in technical reports and short newspaper articles containing a lot of proper names and foreign language quotes, whereas it is considerably lower in common descriptive writings.
Implementation
We currently offer a pure Java implementation, which runs on any platform. Only prerequisite is JRE 1.3 or higher.
A small and clear API simplifies integration into your own product.
Please refer to the developer zone, for information how to install and use each single product API.
Dataset
Depending on the license agreement, the dataset delivered includes either a limited number of entries (free evaluation version), or the full set of entries defined so far. See the license page for further details.