7

I have an application in Java that I am using openCSV to read a file (very large). I am then putting the 4th (Eventually this will have another column or two added if that makes a difference) column into a HashSet and outputting that to a new file. This all seems to work fine but I discovered it is only reading part of the file (131,544 lines of 272,948). Is this a limitation of the openCSV or Java in general or is there a way to get around this?

My code for reference:

public static void main(String[] args) throws IOException {
    String itemsFile = new String();        
    String outFile = new String();
    itemsFile = "items.txt";        
    outFile = "so.txt";
    CSVReader reader = null;
    try {
        reader = new CSVReader(new FileReader(itemsFile), '\t');
    } catch (FileNotFoundException e) {
        System.out.println(e.getMessage());
        e.printStackTrace();
    }

    String[] nextLine;
    HashSet<String> brands = new HashSet<>();               
    while ((nextLine = reader.readNext()) != null) {
        brands.add(nextLine[4]);            
    }               

    String[] brandArray = new String[brands.size()];
    Iterator<String> it = ((HashSet<String>) brands).iterator();
    int listNum = 0;
    while (it.hasNext()) {
        Object brand = (Object) it.next();
        brandArray[listNum] = (String) brand;
        listNum++;
    }

    CSVWriter writer = new CSVWriter(new FileWriter(outFile), '\n');
    writer.writeNext(brandArray);           
    writer.close();
}

I apologize if my code is messy this is my first real "Completed" Java application. Any assistance is much appreciated.

I've even tried removing those lines from the txt file to make sure it's not hanging up on some character or something but it seems to stop on that line anyway

Hirthas
  • 359
  • 2
  • 13
  • Have you printed the size of the collections to better understand what happens? Have you tried to put a breakpoint in your program when it reaches the last read line to see what goes wring? – assylias Feb 20 '13 at 20:21
  • 1
    Also you add items to a hashset, which can't contain duplicates.So if the same string is found more tthan once it will only be added once. That's most likely what is happening. Replace HashSet by ArrayList and see if it works better. – assylias Feb 20 '13 at 20:23
  • @assylias I have tried changing to an Arraylist but I get the same result. I am using a HashSet because I do not want duplicates. I figured out what line it stopped on by adding a counter to the while loop that adds values to the hashset. I will try adding a break point though and see what happens. – Hirthas Feb 20 '13 at 20:29
  • Ah ok - how do you know it's not reading it all then? – assylias Feb 20 '13 at 20:31
  • 1
    @assylias when looking at the output file I am missing about 23 expected values that appear after that line. – Hirthas Feb 20 '13 at 20:39
  • 1
    You can use apache.commons.csv, It has streaming support – diyoda_ Jul 06 '15 at 20:00

2 Answers2

10

OK I figured this out thanks to user @Michael in chat. Apparently openCSV can't handle such a large file because it is not streaming. SO I looked into streaming this file and it works great.

Here's the end code:

public static void main(String[] args) throws IOException {

    String fileName = new String();
    fileName = "items.txt";
    String outputFile = new String();
    outputFile = "so.txt";      
    String thisLine;
    HashSet<String> brand = new HashSet<>();
    FileInputStream fis = new FileInputStream(fileName);
    @SuppressWarnings("resource")
    BufferedReader myInput = new BufferedReader(new InputStreamReader(fis));
    while ((thisLine = myInput.readLine()) != null) {
        String[] line = thisLine.split("\t");
        if (line[20].equals("1")) {
            if (!line[2].equals("") && !line[2].equals(" ")
                    && !line[2].equals(null)) {                 
                if(line[2].indexOf("'") > -1){
                    System.out.println(line[2]);
                    line[2] = line[2].replace("'", "\'");
                    System.out.println(line[2]);
                }

                brand.add(line[2]);
            }
        }
        if (!line[3].equals("") && !line[3].equals(" ")
                && !line[3].equals(null)) {             
                line[3] = line[3].replace("'", "\'");               
            brand.add(line[3]);
        }
        if (!line[4].equals("") && !line[4].equals(" ")
                && !line[4].equals(null)) {
            if(line[4].indexOf("'") > -1){
                System.out.println(line[4]);
                line[4] = line[4].replace("'", "\'");
                System.out.println(line[4]);
            }


            brand.add(line[4]);
        }
    }

    String[] brands = brand.toArray(new String[brand.size()]);

    try {
        FileWriter fstream = new FileWriter(outputFile);
        BufferedWriter bw = new BufferedWriter(fstream);
        for (int i = 0; i < brands.length; i++) {

            if (i == 0) {
                bw.write("'" + brands[i] + "'");
            } else {
                bw.write(",'" + brands[i] + "'");
            }
        }           

        bw.close();
    } catch (Exception e) {
        System.out.println(e.getMessage());
        e.printStackTrace();
    }
}

Thanks for everyone's help on this.

Hirthas
  • 359
  • 2
  • 13
0

For me the issue was a bug in OpenCSV 3.4 when the end of a line coincides with the end of the bufferedReaders buffer.

This test shows the bug:

    @Test
    void readWithBufferSize() throws IOException {

        for (int bufferSize = 2; bufferSize <= 3; bufferSize++) {
            // A <CR> <LF> B <NULL>
            byte[] content = {65, 13, 10, 66, 0};

            InputStream is = new ByteArrayInputStream(content);
            BufferedReader bfReader = new BufferedReader(new InputStreamReader(is), bufferSize);
            CSVReader reader = new CSVReader(bfReader);

            List<String> rows = new ArrayList<>();
            String[] cols;
            while((cols = reader.readNext()) != null) {
                rows.add(String.join(",", cols));
            }

            System.out.printf("buffer size: %d rows: %s%n", bufferSize, String.join(",", rows));
            // this fails for bufferSize = 3
            assert (rows.size() == 2);
        }
    }
D-rk
  • 5,513
  • 1
  • 37
  • 55