Mar 5, 2011

Weka stemming

Since yesterday i've been going crazy about getting Weka (Data Mining tool) to work with a stemmer (Snowball). I finally found a solution thanks to another blogger . The process is as follows:
  1. Download the snowball stemmer .jar from Weka website (link).
  2. Put it into your Weka root directory.
  3. Modify your RunWeka.ini file at that directory so it seems like this: cp=%CLASSPATH%;snowball.jar
Now when you use the StringToWordVector filter, you must choose at "stemmer" the SnowballStemmer option. After that click over the text box on the filter UI. A popup should appear, and there you must enter the language you need to use, in my case I used the spanish stemmer, so i put "spanish" without the quotes.

Note that you can also use a stopword list to remove them when processing the text with the filter. To do so, click over the text field at the GUI. A filechooser window should arise letting you choose your stopword list. The format of this file is: one word per line, lines starting with "#" are considered comments. You should take this into account when creating your own stopword list. Anyway, there are some available on the net for you to use.