How to aggregate CSV data with group by in Java?

Question

Assume that i have a following CSV file of Action log of an application. This csv may contain 3 - 4 million rows.

Company, ActionsType, Action
ABC, Downloaded, Tutorial 1
ABC, Watched, Tutorial 2
PQR, Subscribed, Tutorial 1
ABC, Watched, Tutorial 2
PQR, Subscribed, Tutorial 3
XYZ, Subscribed, Tutorial 1
XYZ, Watched, Tutorial 3
PQR, Downloaded, Tutorial 1

Is there anyway way to aggregate those data by grouping by Company name and show actionType counters as column like shown below using Java?

Company, Downloaded, Watched, Subscribed
ABC, 1, 2, 0
PQR, 1, 0, 2
XYZ, 0, 1, 1

I thought of loading CSV file to a List using OpenCSV, but is it efficient for a csv file of millions of data?

Idk how good it is, but that's a computer vision library, csv is not it's primary goal. I'd look into databases for what you're trying to do http://stackoverflow.com/a/11624706/995891 — zapl, Nov 27 '14 at 09:34
do you want to do this only with java? how about a database solution? — MihaiC, Nov 27 '14 at 09:35
A simple algorithm that puts data into a `Map>` line by line (`BufferedReader#readLine()`, some string split() or regex) would work as well. I'd think that you could do that in less than 100 lines of code. — zapl, Nov 27 '14 at 09:41
@zapl my bad it should be OpenCSV not OpenCV. Yes it is pivoting but have to do with CSV file. i took a look on that but is it possible with millions of data? — Tharik Kanaka, Nov 27 '14 at 09:41
@MihaiC Is there any other way if memory is not enough? thing is all the data is stored in CSV, else i could have write a query — Tharik Kanaka, Nov 27 '14 at 09:42
@TharikKanaka you can load data from CSV into database first. Easily done with sqlldr for Oracle Database or BCP for sqlserver — MihaiC, Nov 27 '14 at 09:44
You don't actually need to store the entire data in memory or a database and the grouping of your data should be really simple as well, https://gist.github.com/anonymous/702eee972e901b624116 should almost work. — zapl, Nov 27 '14 at 10:03

score 2 · Accepted Answer · answered Nov 27 '14 at 10:46

It's definitely inefficient if you're trying to aggregate the data. You should check out MapReduce for aggregating large data.

Here's a solution w/o MapReduce:

import java.io.BufferedReader;
import java.io.StringReader;
import java.util.HashMap;

public class CSVMapper {

    public String transformCsv (String csvFile) {
        return csvMapToString(getCsvMap(csvFile));
    }

    private HashMap<String, Integer[]> getCsvMap (String csvFile) {
        // <K,V> := <Company, [Downloaded, Watched, Subscribed]>
        HashMap<String, Integer[]> csvMap = new HashMap<String, Integer[]>();
        BufferedReader reader = new BufferedReader(new StringReader(csvFile));
        String csvLine;

        // Create map
        try {
            while ((csvLine = reader.readLine()) != null) {
                String[] csvColumns = csvLine.split(",");
                if (csvColumns.length > 0) { 
                    try {
                        String company = csvColumns[0].trim();
                        String actionsType = csvColumns[1].trim();
                        Integer[] columnValues = csvMap.get(company);

                        if (columnValues == null) {
                            columnValues = new Integer[3];
                            columnValues[0] = columnValues[1] = columnValues[2] = 0;
                        }
                        columnValues[0] = columnValues[0] + (actionsType.equals("Downloaded") ? 1 : 0);
                        columnValues[1] = columnValues[1] + (actionsType.equals("Watched")    ? 1 : 0);
                        columnValues[2] = columnValues[2] + (actionsType.equals("Subscribed") ? 1 : 0);

                        if (!company.equals("Company"))
                            csvMap.put(company, columnValues);
                    }
                    catch (Exception nfe) {
                        //TODO: handle NumberFormatException
                    }
                }
            }
        }
        catch (Exception e) {
            //TODO: handle IOException
        }
        return csvMap;
    }

    private String csvMapToString (HashMap<String, Integer[]> csvMap) {
        StringBuilder newCsvFile = new StringBuilder(); 
        newCsvFile.append("Company, Downloaded, Watched, Subscribed\n");
        for (String company : csvMap.keySet()) {
            Integer[] columnValues = csvMap.get(company); 
            newCsvFile.append(company + 
                              ", " + Integer.toString(columnValues[0]) +
                              ", " + Integer.toString(columnValues[1]) +
                              ", " + Integer.toString(columnValues[2]) + "\n");
        }
        return newCsvFile.toString();
    }

    public static void main (String[] args) {
        String csvFile = "Company, ActionsType, Action\n" +
                     "ABC, Downloaded, Tutorial 1\n" +
                     "ABC, Watched, Tutorial 2\n" +
                     "PQR, Subscribed, Tutorial 1\n" +
                     "ABC, Watched, Tutorial 2\n" +
                     "PQR, Subscribed, Tutorial 3\n" +
                     "XYZ, Subscribed, Tutorial 1\n" +
                     "XYZ, Watched, Tutorial 3\n" +
                     "PQR, Downloaded, Tutorial 1";

        System.out.println( (new CSVMapper()).transformCsv(csvFile) );
    }
}

MihaiC · Answer 2 · 2014-11-27T10:12:27.897

Since you are dealing with millions of entries in the CSV, I don't think using Java to parse said file is the best course of action here.

I would instead use Java or .NET to create another application which would do the following:

If you have Oracle Database:

a) load all the data in the CSV into the database using sqlldr. It would do this by invoking an external process call to sqlldr .More info about sqlldr: http://www.thegeekstuff.com/2012/06/oracle-sqlldr/

b) after loading is complete, the program would run a query to extract the data as needed by you, something like this:

WITH T AS (SELECT COMPANY, ACTIONSTYPE FROM tmp_csv)
SELECT *
  FROM T PIVOT (COUNT (1)
         FOR ACTIONSTYPE
         IN ('Downloaded', 'Watched', 'Subscribed'))

c) do what you want with the ResultSet of the query, save it to another csv or create a new table with it so you can query the results from there (the latter is easily achieved by modyfing the query with a create table as statement before it)

If you have SqlServer Database:

Same process as with Oracle, but using bcp utility to load the data from the initial csv. More info here: http://msdn.microsoft.com/en-us/library/ms162802.aspx

Now you can run this application every time the log csv is regenerated, have persistent data saved in the database, which you can query however you want, make reports etc and since you are pushing processing to the database, it is way more efficient than a pure java solution.

If you insist on a java solution, I recommend using parallelism to read and process every record from the csv, adding the results and writing the output in a new csv file.

score 1 · Answer 3 · answered Nov 27 '14 at 10:19

1

If you can store your CSV entries to MongoDB for instance (convert rows to JSon first) you could use map/reduce data processing.

answered Nov 27 '14 at 10:19

Olivier Meurice

554
8
17

score 1 · Answer 4 · answered Nov 27 '14 at 10:51

My suggestion is to use a batch processing framework like Spring Batch.

You can use an implementation of FieldExtractor to map each token of the line to a domain object.

Next you can implement a custom ItemWriter that would hold the items in a collection, or just keep count of a single company, write it to the output, then repeat the process for the next company, something like this which is using a database but you can just change the reader and writer.

Thyag · Answer 5 · 2021-06-08T03:56:01.277

For the example csv file in the question, 4 million row csv file might be about 20 mb. If there are more columns, at max the file size might be 100 mb which should fit into the ram of most modern servers / containers. Also assuming you might not have access to a server side sql database such as Oracle or MySQL or Postgres.

With this, I would avoid hand coding with built-in data structures. I would consider loading the csv into an embedded SQL database such as H2 or Apache Derby and directly write the SQL query to slice and dice data (such as group by, count, etc )

http://h2database.com/html/main.html https://db.apache.org/derby/

Another option is to use the TableSaw library which has dataframe like capabilities similar to the Python Pandas library.

https://jtablesaw.github.io/tablesaw/userguide/reducing.html

Always know your data size, current and future. At billion rows, you will definitely need external/ indexed databases.

How to aggregate CSV data with group by in Java?

5 Answers5