Tagging large files with Stanford's Part-Of-Speech Tagger

Question

I am currently using Java and the IntelliJ IDE to run Stanford's POS tagger. I have set it up using this tutorial: (http://new.galalaly.me/index.php/2011/05/tagging-text-with-stanford-pos-tagger-in-java-applications/). It is running correctly, however, it only outputs roughly two paragraphs worth of text even when I give it much more content than that (the file I have has a size of 774 KB worth of text).

At the bottom of the tutorial it states this for memory problems:

It turns out that the problem is that eclipse allocates on 256MB of memory by default. RightClick on the Project->Run as->Run Configurations->Go to the arguments tab-> under VM arguments type -Xmx2048m This will set the allocated memory to 2GB and all the tagger files should run now.

I have configured IntelliJ to use 4GB of memory per this answer: How to increase IDE memory limit in IntelliJ IDEA on Mac?

Yet, it did not change the amount of outputted text in the slightest.

What else could be causing this to happen?

(link to original site of the POS tagger: https://nlp.stanford.edu/software/tagger.shtml)

EDIT:

I have pasted my Main class below. And TaggedWord is a class that helps me parse and organize the relevant pieces of data retrieved from the tagger.

package com.company;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class Main {

    public static void main(String[] args) {

        File infile = new File("C:\\Users\\TEST\\Desktop\\input.txt");
        File outfile = new File("C:\\Users\\TEST\\Desktop\\output.txt");
        MaxentTagger tagger = new MaxentTagger("tagger/english-left3words-distsim.tagger");
        FileWriter fw;
        BufferedWriter bw;
        List<TaggedWord> taggedWords;

        try {
            //read in entire text file to String
            String fileContents = new Scanner(infile).useDelimiter("\\Z").next();

            //erase contents of outfile from previous run
            PrintWriter pw = new PrintWriter(outfile);
            pw.close();

            //tag file contents with parts of speech
            String fileContentsTagged = tagger.tagString(fileContents);

            taggedWords = processTaggedWords(fileContentsTagged);

            fw = new FileWriter(outfile, true); //true = append
            bw = new BufferedWriter(fw);

            String uasiContent = "";
            boolean firstWord = true;
            for (TaggedWord tw : taggedWords) {
                String englishWord = tw.getEng_word();
                String uasiWord = translate(englishWord);
                if (!tw.isPunctuation()) {
                    uasiContent += uasiWord + " ";
                }
                else {
                    //remove last space
                    uasiContent = uasiContent.substring(0, uasiContent.length() - 1);
                    uasiContent += uasiWord + " ";
                }
            }
            bw.write(uasiContent);
            bw.close();
        }
        catch (FileNotFoundException e1) {
            System.out.println("File not found.");
        }
        catch (IOException e) {
            System.out.print("Error writing to file.");
        }
    }  //end main

EDIT2:

I have now modified the line where I am reading in the file to a string using the while-loop, but it still gives me the same results:

        //read in entire text file to String
        String fileContents = "";
        Scanner sc = new Scanner(infile).useDelimiter("\\Z");
        while (sc.hasNext()) {
            fileContents += sc.next();
        }

Not enough memory would give you an out of memory error rather than partial content - can you post relevant sections of your code where you are feeding the text and also where you are handling the output — Adnan S, Mar 11 '18 at 02:21
You need to iterate through the scanner results - see answer below - let me know if you run into any issues — Adnan S, Mar 11 '18 at 02:44
Hello Ahmed, I did as you suggested but it still reads in only a small amount. Could it be I'm using the wrong delimiter? — nhershy, Mar 11 '18 at 03:49
How are you trying to break the file? Newline/return? Periods? I am not familiar with \\Z delimiter. Another option is to use reader instead of scanner see SO question 16104616 — Adnan S, Mar 11 '18 at 03:53
The file is a book I got from Project Gutenburg. I copied file as-is from the site: https://www.gutenberg.org/files/98/98-0.txt It always read in up to "Chapter V The Wine-shop" — nhershy, Mar 11 '18 at 03:56
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/166607/discussion-between-ahmed-s-and-nhershy). — Adnan S, Mar 11 '18 at 03:59

score 1 · Answer 1 · answered Mar 11 '18 at 02:44

Your Scanner is only get called once where it reads the beginning of the input file. To continue, you need to declare Scanner stand-alone and then iterate using a while loop on hasNext() method. See document and example here on declaring and iterating through scanner.

Tagging large files with Stanford's Part-Of-Speech Tagger

1 Answers1