0

I am exploring Impala for a POC, however I can't see any significant performance. I can't insert 5000 records/sec, at max I was able to insert mere 200/sec. This is really slow considering any database performance.

I tried two different methods but both are slow:

  1. Using Cloudera

    First, I installed Cloudera on my system and added latest CDH 6.2 cluster. I created a java client to insert data using ImpalaJDBC41 driver. I am able to insert record but speed is terrible. I tried tuning impala by increasing Impala Daemon Limit and my system RAM but it didn't help. Finally, I thought there is something wrong with my installation or something so I switched to another method.

  2. Using Cloudera VM

    Cloudera also ships there ready VM for test purpose. I tried my hands on to see if it gives better performance, but there is no big improvement. I still can't insert data 5k/sec speed.

I don't know where do I need to improvement. I have pasted my code below if any improvement can be done.

What is the ideal Impala configuration to achieve speed of (5k - 10k / sec)? This speed is still very less of which Impala is capable.

private static Connection connectViaDS() throws Exception {
    Connection connection = null;
    Class.forName("com.cloudera.impala.jdbc41.Driver");
    connection = DriverManager.getConnection(CONNECTION_URL);
    return connection;
}

private static void writeInABatchWithCompiledQuery(int records) {
    int protocol_no = 233,s_port=20,d_port=34,packet=46,volume=58,duration=39,pps=76,
            bps=65,bpp=89,i_vol=465,e_vol=345,i_pkt=5,e_pkt=54,s_i_ix=654,d_i_ix=444,_time=1000,flow=989;

    String s_city = "Mumbai",s_country = "India", s_latt = "12.165.34c", s_long = "39.56.32d",
            s_host="motadata",d_latt="29.25.43c",d_long="49.15.26c",d_city="Damouli",d_country="Nepal";

    long e_date= 1275822966, e_time= 1370517366;

    PreparedStatement preparedStatement;

    int total = 1000*1000;
    int counter =0;

    Connection connection = null;
    try {
        connection = connectViaDS();

        preparedStatement = connection.prepareStatement(sqlCompiledQuery);

        Timestamp ed = new Timestamp(e_date);
        Timestamp et = new Timestamp(e_time);

        while(counter <total) {
            for (int index = 1; index <= 5000; index++) {
                counter++;

                preparedStatement.setString(1, "s_ip" + String.valueOf(index));
                preparedStatement.setString(2, "d_ip" + String.valueOf(index));
                preparedStatement.setInt(3, protocol_no + index);
                preparedStatement.setInt(4, s_port + index);
                preparedStatement.setInt(5, d_port + index);
                preparedStatement.setInt(6, packet + index);
                preparedStatement.setInt(7, volume + index);
                preparedStatement.setInt(8, duration + index);
                preparedStatement.setInt(9, pps + index);
                preparedStatement.setInt(10, bps + index);
                preparedStatement.setInt(11, bpp + index);
                preparedStatement.setString(12, s_latt + String.valueOf(index));
                preparedStatement.setString(13, s_long + String.valueOf(index));
                preparedStatement.setString(14, s_city + String.valueOf(index));
                preparedStatement.setString(15, s_country + String.valueOf(index));
                preparedStatement.setString(16, d_latt + String.valueOf(index));
                preparedStatement.setString(17, d_long + String.valueOf(index));
                preparedStatement.setString(18, d_city + String.valueOf(index));
                preparedStatement.setString(19, d_country + String.valueOf(index));
                preparedStatement.setInt(20, i_vol + index);
                preparedStatement.setInt(21, e_vol + index);
                preparedStatement.setInt(22, i_pkt + index);
                preparedStatement.setInt(23, e_pkt + index);
                preparedStatement.setInt(24, s_i_ix + index);
                preparedStatement.setInt(25, d_i_ix + index);
                preparedStatement.setString(26, s_host + String.valueOf(index));
                preparedStatement.setTimestamp(27, ed);
                preparedStatement.setTimestamp(28, et);
                preparedStatement.setInt(29, _time);
                preparedStatement.setInt(30, flow + index);
                preparedStatement.addBatch();
            }
            preparedStatement.executeBatch();
            preparedStatement.clearBatch();
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            connection.close();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }
}

Data is updating at snails pace. I tried increasing the batch size but it's decreasing the speed. I don't know if my code is wrong or I need to tune Impala for better performance. Please guide.

I am using VM for testing, here is other details:

System.

Os - Ubuntu 16
RAM - 12 gb
Cloudera - CDH 6.2
Impala daemon limit - 2 gb
Java heap size impala daemon - 500mb
HDFS Java Heap Size of NameNode in Bytes - 500mb.

Please let me know if more details are required.

Community
  • 1
  • 1
gadhvi
  • 97
  • 2
  • 11

1 Answers1

1

You can't benchmark on a VM with 12GB. Look at the Impala's hardware requirements and you'll see you need 128GB of memory minimum.

  • Memory

128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query writes temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.

Also, the VM is used to familiarize yourself with the toolset but it is not powerful enough to even be a development environment.

References

Community
  • 1
  • 1
tk421
  • 5,775
  • 6
  • 23
  • 34