Removing words from a text file in java

Question

I am having some problems with this Java task.

I have two files — hello.txt and stopwords.txt. I am just trying to remove the words that are in the stopwords.txt file in the hello.txt file and have the frequency of the top n elements in the updated hello file displayed in the console.

I know how to do this in python, but not in java. I believe a hash map would be the best approach for this.

Thank you very much!

I have attempted to use this code, but I am not getting any output:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class practice {

    public static void main(String[] args) throws IOException {
        ArrayList stopword = new ArrayList<>();
        try {
            FileInputStream fis = new FileInputStream("stopwords.txt");
            byte b[] = new byte[fis.available()];
            fis.read(b);
            fis.close();
            String data[] = new String(b).trim().split("\n");
            for (int i = 0; i < data.length; i++) {
                stopword.add(data[i].trim());
            }
            FileInputStream fis2 = new FileInputStream("hello.txt");
            byte b1[] = new byte[fis2.available()];
            fis2.read(b);
            fis2.close();
            String data1[] = new String(b1).trim().split("\n");
//                  String myFile="";
            for(int i = 0; i < data1.length; i++) {
                String myFile = "";
                String s2[] = data[i].split("/s");
                for (int j = 0; j < s2.length; j++) {
                    if (!(stopword.contains(s2[j].trim().toLowerCase()))) {
                        myFile = myFile+s2[j]+" ";
                    }
                }
                System.out.println(myFile+"\n");
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        File file = new File("hello.txt");
        try (Scanner sc = new Scanner(new FileInputStream(file))) {
            int count=0;
            while(sc.hasNext()){
                sc.next();
                count++;
            }
            System.out.println("Number of words for new file: " + count);
        }
    }
}

`fis.read(b);` Java may not read the entire file. You have to check the return code and keep reading if the required amount has not been read. Also don't try tread binary, for strings use an InputStreamReader to decode the binary. — markspace, Nov 04 '22 at 03:25
Can you [edit] your question and post sample contents of files _hello.txt_ and _stopwords.txt_ ? — Abra, Nov 04 '22 at 03:50
"_...and have the frequency of the top n elements_" - are you referring the top _n_ elements that were removed or the frequency of words present in the saved file? — hfontanez, Nov 04 '22 at 05:03

hfontanez · Accepted Answer · 2022-11-04T05:06:53.317

Given a file hello.txt containing remove leave remove leave remove leave re move remov e leave remove hello remove world!

And a file stopWords.txt containing remove world

Using the Files class, I can read the entire contents of the file and save it into a (normalized) string. Then, I can use replaceAll() from String class to replace a stopWord from the file. My example doesn't save the new String back to the file, but this can be easily done by adding the following lines:

byte[] strToBytes = helloTxt.getBytes();
Files.write(Paths.get("hello.txt"), strToBytes);

The code to read the file and replace all found stop words:

public class RemoveWords {
    public static void main (String[] args) {
        try {
            // per @markspace's comment
            String helloTxt = Files.readString(Paths.get("hello.txt"), Charset.defaultCharset());
            String stopWordsTxt = Files.readString(Paths.get("stopwords.txt"), Charset.defaultCharset());
            
            String[] stopWords = stopWordsTxt.split("\\s");
            
            for (String stopWord : stopWords) {
                helloTxt = helloTxt.replaceAll(stopWord, "");
            }
            
            System.out.println(helloTxt);
            
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Outputs

 leave  leave  leave re move remov e leave  hello  !

To calculate the frequency of words, you may want to check out this solution I came up with for another use case.

Rather than `readAllBytes` I think `readString` is a better choice. `String(byte[])` is deprecated. https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/file/Files.html#readString(java.nio.file.Path,java.nio.charset.Charset) — markspace, Nov 04 '22 at 04:01
@markspace edited the answer per your comment. Thanks. However, I am using Java 17 and I don't see `String(byte[])` constructor marked as deprecated. — hfontanez, Nov 04 '22 at 04:22
Constructor `String(byte[])` is not deprecated (a few overloaded of constructors receiving a byte array were deprecated more than 20 years ago, so it's not a recent change), however, generating an array of bytes and then copying these bytes into the underlying array of a `String` introduces a redundant step. `Files.readString()` is a good choice here. — Alexander Ivanchenko, Nov 04 '22 at 09:44
That's a good answer. As a word of advice, I think a couple of words about NIO.2 can be useful for OP, since they're using legacy I/O API. Both APIs contain lots of classes, and any beginners tend to confuse `File` and `Files` classes, so one introductory sentence and a link for further reading can be helpful for the questioner and readers. — Alexander Ivanchenko, Nov 04 '22 at 10:01

Removing words from a text file in java

1 Answers1