0

I'm starting to work with big data. My input is a daily csv files with many columns and I've to import these three columns:

  • user_id
  • parent_id
  • grandparent_id
  • .....

what is my goals?

  • I have to count all the unique user_id values belonging to two grandparent_id (12345 and 67890) and group them by parent_ids
  • I have to count all the no exclusive unique user_id values belonging to two grandparent_id (12345 and 67890) and group them by parent_ids

I've designed this data model with three tables (thank to Rock that suggested me to use counter in my previous post)

CREATE TABLE container_u_id (
 parent_id int,
 user_id text,
 grandparent_a boolean,
 grandparent_b boolean,
 PRIMARY KEY (parent_id, user_id)
) 

CREATE TABLE unique_u_id (
  parent_id int,
  counter_uu counter,
  PRIMARY KEY (parent_id)
) 

CREATE TABLE no_exclusive_unique_u_id (
  parent_id int,
  counter_uu counter,
  PRIMARY KEY (parent_id)
)

For importing the data I'm using php and the pdo driver for Cassandra.
Here my script

$relpath = realpath(dirname(__FILE__)) . '/';
include_once($relpath."config/init.php");


$dsn = "cassandra:host=127.0.0.1;port=9160"; 
$db = new PDO($dsn);
$db->exec("USE phpcassa");


$array_grandparent_id = array(12345, 67890);


$file_csv = new CSVFile("file.csv");



foreach ($file_csv as $row) {


    $user_id = $row['USER_ID'];
    $parent_id = $row['PARENT_ID'];



    if(in_array($row['GRANDPARENT_ID'], $array_grandparent_id)){

        $is_grandparent_a  = ($row['GRANDPARENT_ID']==12345) ? 'TRUE' : 'FALSE';
        $is_grandparent_b = ($row['GRANDPARENT_ID']==67890) ? 'TRUE' : 'FALSE';


        $query= "INSERT INTO container_u_id (user_id, parent_id, grandparent_a, grandparent_b) VALUES('".$user_id."', ".$parent_id.", ".$is_grandparent_a.", ".$is_grandparent_b." ) IF NOT EXISTS;";
        $stmt = $db->prepare($query);
        $status_exec = $stmt->execute();
        $result_insert = $stmt->fetchAll();

        $is_uu = ($result_insert[0]['[applied]']) ? True : False;

        if($is_uu===False){

        $status_grandparent_a = $result_insert[0]['grandparent_a'];
        $status_grandparent_b = $result_insert[0]['grandparent_b'];


            //if user belong to both grandparents then it is already present onto both counter tables 
            if($status_grandparent_a===False || $status_grandparent_b===False){

                if($status_grandparent_a===False && $is_grandparent_a=='TRUE'){

                    $query = "UPDATE container_u_id SET grandparent_a = true WHERE user_id='".$user_id."' AND parent_id = ".$parent_id.";";
                    $stmt = $db->prepare($query);
                    $status_exec = $stmt->execute();

                    $query= "UPDATE no_exclusive_so SET counter_uu = counter_uu + 1 WHERE parent_id = ".$parent_id.";";
                    $stmt = $db->prepare($query);
                    $status_exec = $stmt->execute();


                }elseif($status_grandparent_b===False && $is_grandparent_b=='TRUE'){

                    $query = "UPDATE container_u_id SET grandparent_b = true WHERE user_id='".$user_id."' AND parent_id = ".$parent_id.";";
                    $stmt = $db->prepare($query);
                    $status_exec = $stmt->execute();

                    $query= "UPDATE no_exclusive_so SET counter_uu = counter_uu + 1 WHERE parent_id = ".$parent_id.";";
                    $stmt = $db->prepare($query);
                    $status_exec = $stmt->execute();

                }


            }

        }

        if($is_uu){

            $query= "UPDATE unique_u_id SET counter_uu = counter_uu + 1 WHERE parent_id = ".$parent_id.";";
            $stmt = $db->prepare($query);
            $status_exec = $stmt->execute();
        }            
    }  
}

Then starting from this csv sample

Yzc4Jknl0-3c, 34, 12345
Yzc4Jknl0-3c, 34, 67890 
Yzc4Jknl0-3c, 34, 12345
01w44Xz0w3c1, 13, 12345
01w44Xz0w3c1, 21, 12345
01w44Xz0w3c1, 21, 12345  
qFxg0023Exy4, 21, 67890

At the end of the code the table are showing these results

SELECT * FROM container_u_id;   
 parent_id | user_id      | grandparent_a | grandparent_b
-----------+--------------+---------------+--------------
     34    | Yzc4Jknl0-3c |     True      |    True
     13    | 01w44Xz0w3c1 |     True      |    False
     21    | 01w44Xz0w3c1 |     True      |    False
     21    | qFxg0023Exy4 |     False     |    True


SELECT * FROM unique_u_id;
 parent_id | counter_uu
-----------+------------
 34        |       1
 13        |       1   
 21        |       2 


SELECT * FROM no_exclusive_unique_u_id;
 parent_id | counter_uu
-----------+------------
 34        |       1

Now, let's start to talk my doubts. The script works but it's very slow. For fetching 50 million rows the script spends between 8 and 10 hours. The dev environment is:

  • ubuntu 12.04 64 bit
  • cqlsh 4.1.1
  • Cassandra 2.0.6
  • CQL spec 3.1.1
  • Thrift protocol 19.39.0
  • cpu: i3 M 370
  • storage: rotating disk 5400rpm
  • Ring: cluster with single node

Thus I'd like to improve the performance...my question is:

  1. is there anything wrong on my data modeling?
  2. Do I need to add more nodes and change the hardware(aws ec2)?
Cœur
  • 37,241
  • 25
  • 195
  • 267
PistolPete
  • 147
  • 2
  • 10
  • what is `$result_insert` why do fetch an INSERT query – meda Jun 30 '14 at 15:29
  • $result_insert is the response of the query "INSERT INTO container_u_id....IF NOT EXISTS". It will returns the statment of the current insert. The latest field is '[applied]' and value True or False. True the query was inserted, false no. – PistolPete Jun 30 '14 at 15:49
  • Try with just fetch() it will be faster – meda Jun 30 '14 at 15:53
  • @meda I've tested with fetch() but I don't noticed an improvement. I'm working with 50 millions rows per day – PistolPete Jul 01 '14 at 08:15

0 Answers0