Stop words and stemmer in java

I’m thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)

I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex.

String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";

Now that I got those strings

Stemming: Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer in the program, like running one.stem(); kind of thing?

Stop word: How does this work out? O.o Do I just use; one.replaceall(“I”, “”); or is there some specific way to use for this proces? I want to keep working with the string and get a string before using the similarity algorithms on it to get the similarity. Wiki doesn’t say a lot.

Hope you can help me out! Thanks.

Edit: It is for a school-related project where I’m writing a paper on similarity between different algorithms so I don’t think I’m allowed to use lucene or other libraries that does the work for me. Plus I would like to try and understand how it works before I start using the libraries like Lucene and co. Hope it’s not too much a bother ^^

I want a Java Arabic stemmer

I’m looking for a Java stemmer for Arabic. I found a lib called AraMorph , but its output is uncontrollable and it makes formation to words which is unwanted. Is there any other stemmer for Arabic ?

How to remove stop words in java?

I want to remove stop words in java. So, I read stop words from text file. and store Set Set<String> stopWords = new LinkedHashSet<String>(); BufferedReader br = new BufferedReader(new Fi

Is there a java implementation of Porter2 stemmer

Do you know any java implementation of the Porter2 stemmer(or any better stemmer written in java)? I know that there is a java version of Porter(not Porter2) here : http://tartarus.org/~martin/PorterS

Java Lucene English Stemmer?

I need help indexing and searching english text using Java Lucene over Google App Engine. The only solution I have found so far was the SnowballAnalyzer (in the contrib packages), but it only supports

Adding words to SQL Server Full Text Stemmer

I’ve dug around for a few hours now and cannot find an option to do this. What I would like to do is to add words to the stemmer used by Full Text in SQL Server. I work for an agency that would like t

how to remove stop words in english using java program

How to remove stop words in english using java program. Please help me with simplest program or suggest me some ideas. Thanks in advance

how to add custom stop words using lucene in java

I am using lucene to remove English Stop words but my requirement is remove English stop words and Custom stop words. Below is my code to remove English stop words using lucene. My Sample Code: publi

Remove stop words in Java ā€” Help needed

Im using a method to remove stop word defined in a file, that will rip off those words from the query string that i pass to this method… The code is working fine Now what i need to do is … If the

Removing stop words is not tokenizing correctly

I using java program to remove stop words in a word file. But my stop word removal not removing special characters. And I want to remove all the stop words and other unnecessary words,special characte

How to stop NLTK stemmer from removing the trailing ā€œeā€?

I’m using NLTK stemmer to remove grammatical variations of a stem word. However, the Port or Snowball stemmers remove the trailing e of the original form of a noun or verb, e.g., Profile becomes Pro

Answers

Yes, you can wrap any stemmer so that you can write something like

String stemmedString = stemmer.stemAndRemoveStopwords(inputString, stopWordList);

Internally, your stemAndRemoveStopwords would

  • place all stopWords in a Map for fast reference
  • initialize an empty StringBuilder to holde the output string
  • iterate over all words in the input string, and for each word
    • search for it in the stopWordList; if found, continue to top of loop
    • otherwise, stem it using your preferred stemmer, and add it to to the output string
  • return the output string

If you’re not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here’s a quick example using Lucene 3.0 to remove stop words and stem an input string:

public static String removeStopWordsAndStem(String input) throws IOException {
    Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_30, new StringReader(input));
    tokenStream = new StopFilter(true, tokenStream, stopWords);
    tokenStream = new PorterStemFilter(tokenStream);

    StringBuilder sb = new StringBuilder();
    TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(termAttr.term());
    }
    return sb.toString();
}

Which if used on your strings like this:

public static void main(String[] args) throws IOException {
    String one = "I decided buy something from the shop.";
    String two = "Nevertheless I decidedly bought something from a shop.";
    System.out.println(removeStopWordsAndStem(one));
    System.out.println(removeStopWordsAndStem(two));
}

Yields this output:

decid bui someth from shop
Nevertheless decidedli bought someth from shop

You don’t have to deal with the whole text. Just split it, apply your stopword filter and stemming algorithm, then build the string again using a StringBuilder:

StrinBuilder builder = new StringBuilder(text.length());
String[] words = text.split("//s+");
for (String word : words) {
    if (stopwordFilter.check(word)) { // Apply stopword filter.
        word = stemmer.stem(word); // Apply stemming algorithm.
        builder.append(word);
    }
}
text = builder.toString();