Training and Using Your Own Models

Most of the provided functions of the Text Mining Plugin use the OpenNLP library (version 1.5), combining probabilistic and dictionary approaches. The plugin comes with pre-built probabilistic models and dictionaries. Those are included in the plugin zip file under the classes folder. It is possible to train your own models or create your own dictionaries and make Datameer use them by replacing the supplied model files. Note that you might need to restart Datameer in order to make the plugin load the replaced models.

Training and Using Custom Models

There is an OpenNLP model file for each function that uses maximum entropy models (e.g. en-ner-organization.bin for EXTRACT_ORGANIZATION) located in the classes folder of the plugin zip file. Those model files capture the statistics needed for the algorithm to perform the corresponding task. You can either choose one of the models from the OpenNLP Tools Site or train your own model. Please refer to the OpenNLP Documentation Page if you want to train your own model using labeled data.

Creating and Using Custom Dictionaries

Some of the functions combine maximum entropy models and dictionaries to annotate words. Dictionaries are provided as simple lists of words, one word per line (e.g en-ner-organization.list for EXTRACT_ORGANIZATION). The algorithm then combines annotations from both the statistical model and the dictionary. You can easily provide a new dictionary or extend the existing one, if you have other organizations you want Datameer to recognize. Note that this only works for functions that already have a dictionary file associated with them.