0

Hello I have a problem wherein I have to read a huge csv file. remove first field from it, then store only unique values to a file. I have written a program using threads which implements producer-consumer pattern.

Class CSVLineStripper does what the name suggests. Takes a line out of csv, removes first field from every line and adds it to a queue. CSVLineProcessor then takes that field stores all one by one in an arraylist and checks if fields are unique so only uniques are stored. Arraylist is only used for reference. every unique field is written to a file.

Now what is happening is that all fields are stripped correctly. I run about 3000 lines it's all correct. When I start the program for all lines, which are around 7,00,000 + lines, i get incomplete records, about 1000 unique are not taken. Every field is enclosed in double-quotes. What is weird is that the last field in the file that is generated is an incomplete word and ending double quote is missing. Why is this happening?

import java.util.*;
import java.io.*;
class CSVData
{
    Queue <String> refererHosts = new LinkedList <String> ();
    Queue <String> uniqueReferers = new LinkedList <String> (); // final writable queue of unique referers

    private int finished = 0;
    private int safety = 100;
    private String line = "";
    public CSVData(){}
    public synchronized String getCSVLine() throws InterruptedException{
        int i = 0;
        while(refererHosts.isEmpty()){
            if(i < safety){
                wait(10);
            }else{
                return null;
            }
            i++;
        }
        finished = 0;
        line = refererHosts.poll();
        return line;
    }

    public synchronized void putCSVLine(String CSVLine){
        if(finished == 0){ 
            refererHosts.add(CSVLine);
            this.notifyAll();
        }
    }
}
class CSVLineStripper implements Runnable //Producer
{
    private CSVData cd;
    private BufferedReader csv;
    public CSVLineStripper(CSVData cd, BufferedReader csv){ // CONSTRUCTOR
        this.cd = cd;
        this.csv = csv;
    }
    public void run() {
        System.out.println("Producer running");
        String line = "";
        String referer = "";
        String [] CSVLineFields;
        int limit = 700000;
        int lineCount = 1;

        try {
            while((line = csv.readLine()) != null){
                CSVLineFields     = line.split(",");
                referer         = CSVLineFields[0];
                cd.putCSVLine(referer);
                lineCount++;
                if(lineCount >= limit){
                    break;
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println("<<<<<< PRODUCER FINISHED >>>>>>>");
    }

    private String printString(String [] str){
        String string = "";
        for(String s: str){
            string = string + " "+s;
        }
        return string;
    }
}

class CSVLineProcessor implements Runnable
{
    private CSVData cd;
    private FileWriter fw = null;
    private BufferedWriter bw = null;

    public CSVLineProcessor(CSVData cd, BufferedReader bufferedReader){ // CONSTRUCTOR
        this.cd = cd;
        try {
            this.fw = new FileWriter("unique_referer_dump.txt");
        } catch (IOException e) {
            e.printStackTrace();
        }
        this.bw = new BufferedWriter(fw);
    }
    public void run() {
        System.out.println("Consumer Started");
        String CSVLine = "";
        int safety = 10000;
        ArrayList <String> list = new ArrayList <String> ();

        while(CSVLine != null || safety <= 10000){
               try {
                CSVLine = cd.getCSVLine();
                if(!list.contains(CSVLine)){
                    list.add(CSVLine);
                    this.CSVDataWriter(CSVLine);
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
            if(CSVLine == null){
                break;
            }else{
                safety++;
            }
        }

        System.out.println("<<<<<< CONSUMER FINISHED >>>>>>>");
        System.out.println("Unique referers found in 30000 records "+list.size());
    }  
    private void CSVDataWriter(String referer){
        try {
            bw.write(referer+"\n");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}


public class RefererCheck2 
{
    public static void main(String [] args) throws InterruptedException
    {
        String pathToCSV = "/home/shantanu/DEV_DOCS/Contextual_Work/excite_domain_kw_site_wise_click_rev2.csv";
        CSVResourceHandler csvResHandler = new CSVResourceHandler(pathToCSV);
        CSVData cd = new CSVData();
        CSVLineProcessor consumer     = new CSVLineProcessor(cd, csvResHandler.getCSVFileHandler());
        CSVLineStripper producer     = new CSVLineStripper(cd, csvResHandler.getCSVFileHandler());
        Thread consumerThread = new Thread(consumer);
        Thread producerThread = new Thread(producer);
        producerThread.start();
        consumerThread.start();
    }
}

This is how a sample input is:

"xyz.abc.com","4432"."clothing and gifts","true"
"pqr.stu.com","9537"."science and culture","false"
"0.stu.com","542331"."education, studies","false"
"m.dash.com","677665"."technology, gadgets","false"

Producer stores in queue:

"xyz.abc.com"
"pqr.stu.com"
"0.stu.com"
"m.dash.com"

Consumer stores uniques in the file, but after opening file contents one would see

"xyz.abc.com"
"pqr.stu.com"
"0.st
Shades88
  • 7,934
  • 22
  • 88
  • 130

1 Answers1

3

Couple things, you are breaking after 700k, not 7m, also you are not flushing your buffered writer, so the last stuff you could be incomplete, add flush at end and close all your resources. Debugger is a good idea :)

Joelio
  • 4,621
  • 6
  • 44
  • 80
  • wat an utter stupidity !!! Yes, I had not flushed the BufferedWriter. It worked like magic. Thanks a 7million – Shades88 Jun 27 '12 at 19:08