0

I am making a simple program where I am using a sample of PDF files to build a full text indexing on my database. The idea is I read each PDF file, extract the words and store them in a hashset.

Then, add each word in a loop to the table in MySQL along with it's file path. So, each word is looped through to be stored in each column until it finishes. It works perfectly fine. However , when it comes to large PDF files which contains thousands and thousands of words, it might take some time to build the index table.In other words, it takes long time to save each word to the database as extraction of words is fast.

Code:

public class IndexTest {

public static void main(String[] args) throws Exception {
    // write your code here
    //String path ="D:\\Full Text Indexing\\testIndex\\bell2009a.pdf";
    // HashSet<String> uniqueWords = new HashSet<>();
    /*StopWatch stopwatch = new StopWatch();
    stopwatch.start();*/
    File folder = new File("D:\\PDF1");
    File[] listOfFiles = folder.listFiles();

    for (File file : listOfFiles) {
        if (file.isFile()) {
            HashSet<String> uniqueWords = new HashSet<>();
            String path = "D:\\PDF1\\" + file.getName();
            try (PDDocument document = PDDocument.load(new File(path))) {

                if (!document.isEncrypted()) {

                    PDFTextStripper tStripper = new PDFTextStripper();
                    String pdfFileInText = tStripper.getText(document);
                    String lines[] = pdfFileInText.split("\\r?\\n");
                    for (String line : lines) {
                        String[] words = line.split(" ");

                        for (String word : words) {
                            uniqueWords.add(word);

                        }

                    }
                    // System.out.println(uniqueWords);

                }
            } catch (IOException e) {
                System.err.println("Exception while trying to read pdf document - " + e);
            }
            Object[] words = uniqueWords.toArray();
            String unique = uniqueWords.toString();
            //  System.out.println(words[1].toString());



            for(int i = 1 ; i <= words.length - 1 ; i++ ) {
                MysqlAccessIndex connection = new MysqlAccessIndex();
                connection.readDataBase(path, words[i].toString());

            }

            System.out.println("Completed");

        }
    }

SQL connection code:

 public class MysqlAccessIndex {

      public MysqlAccessIndex() throws Exception {
        Class.forName("com.mysql.jdbc.Driver");
        connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.178/fulltext_ltat?"
                        + "user=root&password=root123");
      //  statement = connect.createStatement();
        System.out.print("Connected");
    }


    public void readDataBase(String path,String word) throws Exception {
        try {




            statement = connect.createStatement();
            System.out.print("Connected");


            preparedStatement = connect
                    .prepareStatement("insert IGNORE into  fulltext_ltat.test_text values (?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();
            // resultSet = statement
            //.executeQuery("select * from fulltext_ltat.index_detail");



            //  writeResultSet(resultSet);
        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

Is there any suggestion to improve or optimize the performance issue?

Daredevil
  • 1,672
  • 3
  • 18
  • 47

1 Answers1

1

The issue lies in the following code:

// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager.getConnection(
        "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");

You're recreating the connection for every word you're inserting into your database. A better way would be something like this:

public MysqlAccess() {
    connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
                        + "user=root&password=root");
}

This way you're only creating the connect the first time an instance of that class is created. Inside your main method you have to create the MysqlAccess instance outside your for loop, so it only gets created once.

MysqlAccess will look something like this:

public class MysqlAccess {

    private Connection connect = null;
    private Statement statement = null;
    private PreparedStatement preparedStatement = null;
    private ResultSet resultSet = null;

    public MysqlAccess() {
        // Setup the connection with the DB
        connect = DriverManager.getConnection(
                "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");
    }

    public void readDataBase(String path, String word) throws Exception {
        try {
            // Statements allow to issue SQL queries to the database
            statement = connect.createStatement();
            System.out.print("Connected");
            // Result set get the result of the SQL query

            preparedStatement = connect.prepareStatement(
                    "insert IGNORE into  fulltext_ltat.test_text values (default,?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();

        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

    private void writeResultSet(ResultSet resultSet) throws SQLException {
        // ResultSet is initially before the first data set
        while (resultSet.next()) {
            // It is possible to get the columns via name
            // also possible to get the columns via the column number
            // which starts at 1
            // e.g. resultSet.getSTring(2);
            String path = resultSet.getString("path");
            String word = resultSet.getString("word");

            System.out.println();
            System.out.println("path: " + path);
            System.out.println("word: " + word);

        }
    }
}
Mark
  • 5,089
  • 2
  • 20
  • 31
  • So how does the Main method look like now? I can remove Class.forName? – Daredevil Oct 22 '18 at 08:57
  • Edited the answer, and be sure to put `MysqlAccess connection = new MysqlAccess();` at the top of your `main` function so it is outside the for loops and it only created once. – Mark Oct 22 '18 at 09:02
  • can you invite me to a discussion room real quick? I have some question regarding your answer – Daredevil Oct 22 '18 at 09:03
  • You should indeed remove it. It hasn't been needed since 2007. – user207421 Oct 22 '18 at 09:03
  • It doesn't let me add Class.forName claiming java.lang.ClassNotFoundException – Daredevil Oct 22 '18 at 09:05
  • 1
    I'm not sure how discussion rooms work. You don't need `Class.forName` anymore like @user207421 suggested – Mark Oct 22 '18 at 09:06
  • @Mark Also, in .GetConnection it says Java.SQL.sqlexception . Does it throw this error if my server is not connected? – Daredevil Oct 22 '18 at 09:07
  • 1
    `Class.forName` is only needed where Java SPI does not work, typically application server, OSGi modular software or such. – Joop Eggen Oct 22 '18 at 09:08
  • According to [DriverManager](https://docs.oracle.com/javase/7/docs/api/java/sql/DriverManager.html#getConnection(java.lang.String)) it will throw an `SQLException` if `a database access error occurs`, so it it's offline it could throw that. – Mark Oct 22 '18 at 09:08
  • @Mark Ok thank you mark, so I just call that method in my MAIN function right? – Daredevil Oct 22 '18 at 09:09
  • You only have to put `MysqlAccess connection = new MysqlAccess();` at the top of your `main` method. The answer connects to the database in the constructor of that class. See [Constructor](https://docs.oracle.com/javase/tutorial/java/javaOO/objectcreation.html) for more info if you're unsure what that is. – Mark Oct 22 '18 at 09:10
  • @Mark I'm gonna try now and get back to you. So this certainly helps improves the performance you reckon? – Daredevil Oct 22 '18 at 09:11
  • Yes it will help a lot – Mark Oct 22 '18 at 09:12
  • @Mark Seems like I couldn't establish a connection to my db through .getConnection – Daredevil Oct 22 '18 at 09:16
  • Any errors? You said it worked before, I have not changed anything about how the connection is created. – Mark Oct 22 '18 at 09:18
  • @Mark I think since you seperate the connection to a different method, you forgot to add a try catch block to catch the exception as what I did in my original code. – Daredevil Oct 22 '18 at 09:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/182253/discussion-between-daredevil-and-mark). – Daredevil Oct 22 '18 at 09:23
  • @Mark So I tried your solution, however right after it opens the connection , it only runs the loop for 1 time before closing it as there is no operations allowed after connection is closed. – Daredevil Oct 23 '18 at 06:44
  • Remove the part in `readDatabase` where it closes the connection. – Mark Oct 23 '18 at 07:13