WMTrans Unknown Word Recognizer:
API Description
The Unknown Word Recognizer API consists of three methods of the Java class AdHocRecognizer.
- The first one is a static method used to load the required data. It delivers an instance of the class AdHocRecognizer. There are two more overloadings of it, used to change the default cache behaviour.
- The second one is used to retrieve lexicalized words. The invoker must be the AdHocRecognizer object. This function represents the same functionality as the lexicalized word Recognizer, delivering information on whether a form is a correct lexicalized form or not.
- The third one is used to analyze unknown words. If the word does not correspond to a lexicalized lexeme form, the form is analyzed with word formation rules and in case of match the answer is "yes" (true), meaning that the form is a potentially correct form. All methods throw a generic exception in case of failure.
A client program must import the class com.canoo.wmtrans.AdHocRecognizer, as shown in IntegrationDemo.java
Instantiation
public static
AdHocRecognizer instance
(String surfFsmName, String surfTabName,
String wfFsmName, String wfTabName, String wfTripleName,
String suffixTraName,
String lexicalizedFsmName,String lexicalizedTableName,
String postFilterName) throws Exception;
The method loads the data into the memory. Note that this creates a singleton, i.e. different calls to the same method always return the same AdHocRecognizer instance. If you need to load different instances within the same program, please contact us.
Parameters:
- Automaton file for surface information (delivered as XYZnAdHoc-surface.fsa)
- Feature table file for surface (delivered as XYZnAdHoc-surface.tab)
- Transducer for word formation information (delivered as XYZnAdHoc-rules.tra)
- Feature table to word formation rules (delivered as XYZnAdHoc-rules.tab)
- Help table to word formation rules (delivered as XYZnAdHoc-rules.triple)
- Suffix transducer (delivered as XYZnAdHoc-citsuffix.tra)
- Lexicalized data transducer (delivered as XYZnAdHoc-lemmatizer.tra)
- Feature table (delivered as XYZnAdHoc-lemmatizer.tab)
- Post filters (delivered as XYZnAdHoc-postfilters.xml)
How to change the cache behaviour
Each unknown word product has an internal transparent cache, used to improve performance. The cache manages frequently used queries, allowing fast retrieval without further analysis. The instance() method described above has two additional overloading variants, which can be used to add some optional information regarding the internal cache behaviour.Creating a persistent cache
To specify the file path in which a persistent state of the cache is stored:- Use the first overloading variant to specify the file path - absolute path, or relative to the starting directory - in which the persistent state of the cache is stored. A persistent cache can be accessed by different analysis sessions. The default cache is a temporary cache, i.e. it does not keep the state of a session for the next analysis session. Using this overloading method, you can change this behaviour to transform the default temporary cache into a persistent cache.
The parameters are the same as the base instance() version. Add a new "String" parameter to the end and specify the cache file path.
public static AdHocRecognizer instance (String surfFsmName, String surfTabName, String wfFsmName, String wfTabName, String wfTripleName, String suffixTraName, String lexicalizedFsmName,String lexicalizedTableName, String postFilterName, String pathname) throws Exception;If the process does not have enough privileges to read the file, the information is ignored, and the cache will be created without persistent state.
If the process does not have enough privileges to create or to change a file, the persistent state will not be created or changed.
Every change to the default settings is listed at load time by a message in the standard error. Note that the first version of the persistent data is created after the first 200 elements have been inserted into the cache. If in your analysis, the program adds less than 200 elements into the cache, they will not be stored.
Changing the default cache size
To change the default cache size:- The second overloading variant provides a further parameter. To change the default size of the cache, add an "int" value at the last position following the String pathname. The current default size is 10 000. To change the default size, specify a different int value.
To fully disable the cache behaviour, you must pass an int value <=0.
public static AdHocRecognizer instance (String surfFsmName, String surfTabName, String wfFsmName, String wfTabName, String wfTripleName, String suffixTraName, String lexicalizedFsmName,String lexicalizedTableName, String postFilterName, String pathname, int maxCacheSize) throws Exception;If you need to change the cache size temporarily, pass "null" to the String pathname argument, and add the new "int" value to the last argument.
Lexicalized Word
public boolean
recognizeAsLexicalizedForm(String query, String filter)
throws Exception;
The method analyzes a word, be it in inflected or citation form, and reports if it is a valid lexicalized word form.
Parameters:
- String containing the word form to be analyzed and recognized
- String containing the feature filter (null if no filter is needed)
Unknown Word
public boolean
recognizeAsUnknownForm(String query, String filter)
throws Exception;
The method analyzes the unknown (not lexicalized) form and reports if it is a potentially valid word form.
Parameters:
- String containing word form to be analyzed and recognized
- String containing the feature filter (null if no filter is needed)
Combining API Functions
The two API analysis functions are complementary. The first one (recognizeAsLexicalizedForm) only recognizes lexicalized forms, the second one (recognizeAsUnknownForm) only recognizes not lexicalized forms. In order to be able to correctly recognize all kinds of forms, you need to use both functions in following way (also shown in the delivered IntegrationDemo.java source file):AdHocRecognizer recognizer = AdHocRecognizer.instance(....); ... boolean result = recognizer.recognizeAsLexicalizedForm(form,filter); if (!result) result = recognizer.recognizeAsUnknownForm(form,filter); ...