WMTrans Unknown Word Analyzer:
API Description
The Unknown Word Analyzer API consists of five methods of the Java class AdHocAnalyzer.
- The first one is a static method used to load the required data. It delivers an instance of the class AdHocAnalyzer. There are two more overloadings of it, used to change the default cache behaviour.
- The second one is used to retrieve lexicalized words. The invoker must be the AdHocAnalyzer object. This function represents the same functionality as the lexicalized word Inflection Analyzer, delivering citation forms and full inflectional information on the analyzed word form.
- The third to fifth ones are used to analyze unknown words. If the word does not correspond to a lexicalized lexeme form, the form is analyzed with word formation rules and in case of match a rich set of information is delivered. The three methods all analyze the form as an unknown form. The difference among them only consists in the type of delivered output.
A client program must import the class com.canoo.wmtrans.AdHocAnalyzer, as shown in IntegrationDemo.java
Instantiation
public static AdHocAnalyzer instance (
String surfFsmName, String surfTabName,
String wfFsmName, String wfTabName, String wfTripleName,
String suffixTraName, String suffixTabName,
String lexicalizedFsmName,String lexicalizedTableName,
String postFilterName) throws Exception;
public static AdHocAnalyzer instance (
String surfFsmName, String surfTabName,
String wfFsmName, String wfTabName, String wfTripleName,
String suffixTraName, String suffixTabName,
String lexicalizedFsmName,String lexicalizedTableName,
String postFilterName, String licenseAbsolutePath) throws Exception;
The method loads the data into the memory. Note that this creates a singleton, i.e. different calls to the same method always return the same instance. If you need to load different instances within the same program, use one of following corresponding instantiation methods:
public static AdHocAnalyzer newInstance(...) throws Exception;
public static AdHocAnalyzer
newInstance(..., String licenseAbsolutePath) throws Exception;
Parameters:
- Automaton file for surface information (delivered as XYZnAdHoc-surface.fsa)
- Feature table file for surface (delivered as XYZnAdHoc-surface.tab)
- Transducer for word formation information (delivered as XYZnAdHoc-rules.tra)
- Feature table to word formation rules (delivered as XYZnAdHoc-rules.tab)
- Help table to word formation rules (delivered as XYZnAdHoc-rules.triple)
- Suffix transducer (delivered as XYZnAdHoc-citsuffix.tra)
- Suffix feature table (delivered as XYZnAdHoc-citsuffix.tab)
- Lexicalized data transducer (delivered as XYZnAdHoc-analyzer.tra)
- Feature table (delivered as XYZnAdHoc-analyzer.tab)
- Post filters (delivered as XYZnAdHoc-postfilters.xml)
- Licence file path (if not provided, the license must be made available on the classpath)
How to change the cache behaviour
Each unknown word product has an internal transparent cache, used to improve performance. The cache manages frequently used queries, allowing fast retrieval without further analysis. The instance() and newInstance() methods all have an additional overloading variant that can be used to add some optional information regarding the internal cache behaviour. The variants allow to specify the cache file path, to store the persistent state, and the maximal cache size.
public static AdHocAnalyzer
instance (..., String cachePath, int maxCacheSize) throws Exception;
Creating a persistent cache
To specify the file path in which a persistent state of the cache is stored:- Use the overloading variant to specify the file path - absolute path, or relative to the starting
directory - in which the persistent state of the cache is stored.
A persistent cache can be accessed by
different analysis sessions. The default cache is a temporary cache, i.e. it does not keep the state of a
session for the next analysis session. Using this overloading method, you can change this behaviour to transform
the default temporary cache into a persistent cache.
A null value means that you only want a temporary cache.
If the process does not have enough privileges to read the file, the information is ignored, and the cache will be created without persistent state.
If the process does not have enough privileges to create or to change a file, the persistent state will not be created or changed.
Note that the first version of the persistent data is created after the first 200 elements have been inserted into the cache. If in your analysis, the program adds less than 200 elements into the cache, they will not be stored.
Changing the default cache size
To change the default cache size:
- The second added parameter is used to change the default size of the cache. An "int"
value is passed, after the cache pathname. The current default size is 10 000. To change the default
size, specify a different int value. To fully disable the cache behaviour, you must pass an int value <=0.
If you need to change the cache size temporarily, pass "null" to the String pathname argument, and add the new "int" value to the last argument.
Lexicalized Word
public String[]
analyzeAsLexicalizedForm(String query, String filter)
throws Exception;
The method retrieves the citation forms and the full inflection information from a lexicalized lexeme's word form.
Parameters:
- String containing the word form to be recognized
- String containing the feature filter
The return value contains the results. If the form is not part of any lexicalized lexeme, the function returns null.
Unknown Word
public String[]
analyzeAsUnknownForm(String query, String filter)
throws Exception;
The method analyzes the unknown (not lexicalized) form and generates all possible inflection and word formation information derived from the analysis through word formation rules. The two remaining methods have the same functionality, but they both deliver reduced information: only inflection or only word formation. See the output syntax description to understand how to interpret the results.
Parameters:
- String containing form to be analyzed
- String containing the feature filter
public String[]
analyzeAsUnknownFormOnlyInflection(String query, String filter)
throws Exception;
This method has the same functionality and parameters as "analyzeAsUnknownForm", but it only delivers a subset of the available information. It delivers information on inflection.
public String[]
analyzeAsUnknownFormOnlyWF(String query, String filter)
throws Exception;
This method has the same functionality and parameters as "analyzeAsUnknownForm", but it only delivers a subset of the available information. It delivers information on word formation.
Combining API Functions
The two API analysis functions (lexicalized and unknown word) are complementary. The first function (analyzeAsLexicalizedForm) only retrieves lexicalized forms, the second function (analyzeAsUnknownForm) only analyses not lexicalized forms.
In order to be able to correctly retrieve, analyze and interpret all kinds of forms, you need to use both functions in following way (also shown in the delivered IntegrationDemo.java source file):
AdHocAnalyzer analyzer = AdHocAnalyzer.instance(....); ... String[] results = analyzer.analyzeAsLexicalizedForm(form,filter); if (result == null) result = analyzer.analyzeAsUnknownForm(form,filter); ...