3

Input set: thousands(>10000) of csv files, each containing >50000 entries. output: Store those data in mysql db.

Approach taken: Read each file and store the data into database. Below is the code snippet for the same. Please suggest if this approach is ok or not.

    PreparedStatement pstmt2 = null;
try 
{
pstmt1 = con.prepareStatement(sqlQuery);
result = pstmt1.executeUpdate();
con.setAutoCommit(false);
sqlQuery = "insert   into "
        + tableName
        + " (x,y,z,a,b,c) values(?,?,?,?,?,?)";
pstmt2 = con.prepareStatement(sqlQuery);
Path file = Paths.get(filename);

lines = Files.lines(file, StandardCharsets.UTF_8);
final int batchsz = 5000;
for (String line : (Iterable<String>) lines::iterator) {

    pstmt2.setString(1, "somevalue");
    pstmt2.setString(2, "somevalue");
    pstmt2.setString(3, "somevalue");
    pstmt2.setString(4, "somevalue");
    pstmt2.setString(5, "somevalue");
    pstmt2.setString(6, "somevalue");
    pstmt2.addBatch();
    if (++linecnt % batchsz == 0) {
        pstmt2.executeBatch();
    }
}
int batchResult[] = pstmt2.executeBatch();
pstmt2.close();
con.commit();

} catch (BatchUpdateException e) {
    log.error(Utility.dumpExceptionMessage(e));

} catch (IOException ioe) {
    log.error(Utility.dumpExceptionMessage(ioe));
} catch (SQLException e) {
    log.error(Utility.dumpExceptionMessage(e));
} finally {
    lines.close();
    try {
        pstmt1.close();
        pstmt2.close();
    } catch (SQLException e) {
        Utility.dumpExceptionMessage(e);
    }
}
Kainix
  • 1,186
  • 3
  • 21
  • 33
basu
  • 87
  • 1
  • 2
  • 10
  • I would try an bulk export to dump or import to insert them all. – Peter Lawrey Apr 19 '16 at 14:16
  • 3
    Shall we clarify the terminology? In sql world dump means data that has been exported from a database either as a set of sql queries, csv or a proprietor format. Getting data from a dump or into a database is called importing or loading. Now if you could clarify your question ... – e4c5 Apr 19 '16 at 14:29
  • I recently handled a similar problem, in my case the data is 600,000+. My solution is using multithread and blocking queue to import the data. However, multithreading does not actually improve the performance, it only reduces the response time. If you really want to make it quicker, consider parallel solution. But 50,000 is too small to parallelize, the complexity is far outweigh the benefit. Therefore, I suggest you stick with single thread with bulk insertion. – Junbang Huang Apr 19 '16 at 17:18
  • @e4c5 - thanks for the clarification. I've modified the title. The goal is, read the csv files, and store the data into db. – basu Apr 19 '16 at 18:59

2 Answers2

2

I've used LOAD DATA INFILE ins situations like this in the past.

The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed. LOAD DATA INFILE is the complement of SELECT ... INTO OUTFILE. (See Section 14.2.9.1, “SELECT ... INTO Syntax”.) To write data from a table to a file, use SELECT ... INTO OUTFILE. To read the file back into a table, use LOAD DATA INFILE. The syntax of the FIELDS and LINES clauses is the same for both statements. Both clauses are optional, but FIELDS must precede LINES if both are specified.

The IGNORE number LINES option can be used to ignore lines at the start of the file. For example, you can use IGNORE 1 LINES to skip over an initial header line containing column names:

LOAD DATA INFILE '/tmp/test.txt' INTO TABLE test IGNORE 1 LINES;

http://dev.mysql.com/doc/refman/5.7/en/load-data.html

Rodrigo
  • 604
  • 4
  • 13
  • These csv file format is custom. It has initial some lines with some device details and after that, the column names and its respective values continue. Can those initial lines be ignored ? – basu Apr 19 '16 at 19:23
  • Edit with "IGNORE LINES", hope it helps. – Rodrigo Apr 20 '16 at 12:27
  • Really helpful with your detailed description. – basu Apr 21 '16 at 09:17
1

As @Ridrigo has already pointed out, LOAD DATA INFILE is the way to go. Java is not really needed at all.

If the format of your CSV is not something that can directly be inserted into the database, your Java code can renter the picture. Use it to reorganize/transform the CSV and save it as another CSV file instead of writing it into the database.

You can also use the Java code to iterate through the folder that contains the CSV, and then execute the system command for the

Runtime r = Runtime.getRuntime();
Process p = r.exec("mysql -p password -u user database -e 'LOAD DATA INFILE ....");

you will find that this is much much faster than running individual sql queries for each row of the CSV file.

e4c5
  • 52,766
  • 11
  • 101
  • 134