WMTrans Unknown Word Lemmatizer:
API Description

The Unknown Word Lemmatizer API consists of three methods of the Java class AdHocLemmatizer.
  • The first one is a static method used to load the required data. It delivers an instance of the class AdHocLemmatizer. There are two more overloadings of it, used to change the default cache behaviour.
  • The second one is used to retrieve lexicalized words. The invoker must be the AdHocLemmatizer object. This function has the same functionality as the lexicalized word Lemmatizer, delivering information on citation forms and categories.
  • The third one is used to analyze unknown words. If the word does not correspond to a lexicalized lexeme form, the form is analyzed using word formation rules and in case of match a citation form with corresponding features is delivered.
  • All methods throw a generic exception in case of failure.
    A client program must import the class com.canoo.wmtrans.AdHocLemmatizer, as shown in IntegrationDemo.java

Instantiation


public static 
  AdHocLemmatizer instance
    (String surfFsmName, String surfTabName, 
     String wfFsmName, String wfTabName, String wfTripleName, 
     String suffixTraName, 
     String lexicalizedFsmName,String lexicalizedTableName, 
     String postFilterName) throws Exception; 

The method loads the data into the memory. Note that this creates a singleton, i.e. different calls to the same method always return the same AdHocLemmatizer instance. If you need to load different instances within the same program, please contact us.

Parameters:

  1. Automaton file for surface information (delivered as XYZnAdHoc-surface.fsa)
  2. Feature table file for surface (delivered as XYZnAdHoc-surface.tab)
  3. Transducer for word formation information (delivered as XYZnAdHoc-rules.tra)
  4. Feature table to word formation rules (delivered as XYZnAdHoc-rules.tab)
  5. Help table to word formation rules (delivered as XYZnAdHoc-rules.triple)
  6. Suffix transducer (delivered as XYZnAdHoc-citsuffix.tra)
  7. Lexicalized data transducer (delivered as XYZnAdHoc-lemmatizer.tra)
  8. Feature table (delivered as XYZnAdHoc-lemmatizer.tab)
  9. Post filters (delivered as XYZnAdHoc-postfilters.xml)
The return value corresponds to the AdHocLemmatizer object, to be used later as invoker.

How to change the cache behaviour

Each unknown word product has an internal transparent cache, used to improve performance. The cache manages frequently used queries, allowing fast retrieval without further analysis. The instance() method described above has two additional overloading variants, which can be used to add some optional information regarding the internal cache behaviour.

Creating a persistent cache

To specify the file path in which a persistent state of the cache is stored:
  • Use the first overloading variant to specify the file path - absolute path, or relative to the starting directory - in which the persistent state of the cache is stored. A persistent cache can be accessed by different analysis sessions. The default cache is a temporary cache, i.e. it does not keep the state of a session for the next analysis session. Using this overloading method, you can change this behaviour to transform the default temporary cache into a persistent cache.
    The parameters are the same as the base instance() version. Add a new "String" parameter to the end and specify the cache file path.
    public static 
      AdHocLemmatizer instance
       (String surfFsmName, String surfTabName, 
        String wfFsmName, String wfTabName, String wfTripleName, 
        String suffixTraName, 
        String lexicalizedFsmName,String lexicalizedTableName, 
        String postFilterName,
        String pathname) throws Exception; 
    
    If the process does not have enough privileges to read the file, the information is ignored, and the cache will be created without persistent state.
    If the process does not have enough privileges to create or to change a file, the persistent state will not be created or changed.
    Every change to the default settings is listed at load time by a message in the standard error. Note that the first version of the persistent data is created after the first 200 elements have been inserted into the cache. If in your analysis, the program adds less than 200 elements into the cache, they will not be stored.

Changing the default cache size

To change the default cache size:

  • The second overloading variant provides a further parameter. To change the default size of the cache, add an "int" value at the last position following the String pathname. The current default size is 10 000. To change the default size, specify a different int value. To fully disable the cache behaviour, you must pass an int value <=0.
    public static 
      AdHocLemmatizer instance
        (String surfFsmName, String surfTabName, 
        String wfFsmName, String wfTabName, String wfTripleName, 
        String suffixTraName, 
        String lexicalizedFsmName,String lexicalizedTableName, 
        String postFilterName,
        String pathname, int maxCacheSize) throws Exception; 
    
    If you need to change the cache size temporarily, pass "null" to the String pathname argument, and add the new "int" value to the last argument.
  • Lexicalized Word

    
    public String[] 
        analyzeAsLexicalizedForm(String query, String filter) 
        throws Exception;
    

    The method retrieves the citation forms from a lexicalized lexeme's word form.

    Parameters:

    1. String containing the word form to be recognized
    2. String containing the feature filter
    The return value contains the results. If the form is not part of any lexicalized lexeme, the function returns null.

    Unknown Word

    
    public String[] 
        analyzeAsUnknownForm(String query, String filter) 
          throws Exception;
    

    The method analyzes the unknown (not lexicalized) form and generates the possible citation forms and categories information.

    Parameters:

    1. String containing form to be analyzed
    2. String containing the feature filter
    The return value contains the results. If the form cannot be analyzed with the current word formation rules the function returns null. If the form is a lexicalized form, the function also returns null.

    Combining API Functions

    The two API analysis functions are complementary. The first one (analyzeAsLexicalizedForm) only retrieves lexicalized forms, the second one (analyzeAsUnknownForm) only analyses not lexicalized forms. In order to be able to correctly retrieve, analyze and interpret all kinds of forms, you need to use both functions in following way (also shown in the delivered IntegrationDemo.java source file):
    AdHocLemmatizer lemmatizer = AdHocLemmatizer.instance(....);
    ...
    String[] 
      results = lemmatizer.analyzeAsLexicalizedForm(form,filter);
    if (result == null)
       result = lemmatizer.analyzeAsUnknownForm(form,filter);
    ...