I'm starting to work with big data. My input is a daily csv files with many columns and I've to import these three columns:
- user_id
- parent_id
- grandparent_id
- .....
what is my goals?
- I have to count all the unique user_id values belonging to two grandparent_id (12345 and 67890) and group them by parent_ids
- I have to count all the no exclusive unique user_id values belonging to two grandparent_id (12345 and 67890) and group them by parent_ids
I've designed this data model with three tables (thank to Rock that suggested me to use counter in my previous post)
CREATE TABLE container_u_id (
parent_id int,
user_id text,
grandparent_a boolean,
grandparent_b boolean,
PRIMARY KEY (parent_id, user_id)
)
CREATE TABLE unique_u_id (
parent_id int,
counter_uu counter,
PRIMARY KEY (parent_id)
)
CREATE TABLE no_exclusive_unique_u_id (
parent_id int,
counter_uu counter,
PRIMARY KEY (parent_id)
)
For importing the data I'm using php and the pdo driver for Cassandra.
Here my script
$relpath = realpath(dirname(__FILE__)) . '/';
include_once($relpath."config/init.php");
$dsn = "cassandra:host=127.0.0.1;port=9160";
$db = new PDO($dsn);
$db->exec("USE phpcassa");
$array_grandparent_id = array(12345, 67890);
$file_csv = new CSVFile("file.csv");
foreach ($file_csv as $row) {
$user_id = $row['USER_ID'];
$parent_id = $row['PARENT_ID'];
if(in_array($row['GRANDPARENT_ID'], $array_grandparent_id)){
$is_grandparent_a = ($row['GRANDPARENT_ID']==12345) ? 'TRUE' : 'FALSE';
$is_grandparent_b = ($row['GRANDPARENT_ID']==67890) ? 'TRUE' : 'FALSE';
$query= "INSERT INTO container_u_id (user_id, parent_id, grandparent_a, grandparent_b) VALUES('".$user_id."', ".$parent_id.", ".$is_grandparent_a.", ".$is_grandparent_b." ) IF NOT EXISTS;";
$stmt = $db->prepare($query);
$status_exec = $stmt->execute();
$result_insert = $stmt->fetchAll();
$is_uu = ($result_insert[0]['[applied]']) ? True : False;
if($is_uu===False){
$status_grandparent_a = $result_insert[0]['grandparent_a'];
$status_grandparent_b = $result_insert[0]['grandparent_b'];
//if user belong to both grandparents then it is already present onto both counter tables
if($status_grandparent_a===False || $status_grandparent_b===False){
if($status_grandparent_a===False && $is_grandparent_a=='TRUE'){
$query = "UPDATE container_u_id SET grandparent_a = true WHERE user_id='".$user_id."' AND parent_id = ".$parent_id.";";
$stmt = $db->prepare($query);
$status_exec = $stmt->execute();
$query= "UPDATE no_exclusive_so SET counter_uu = counter_uu + 1 WHERE parent_id = ".$parent_id.";";
$stmt = $db->prepare($query);
$status_exec = $stmt->execute();
}elseif($status_grandparent_b===False && $is_grandparent_b=='TRUE'){
$query = "UPDATE container_u_id SET grandparent_b = true WHERE user_id='".$user_id."' AND parent_id = ".$parent_id.";";
$stmt = $db->prepare($query);
$status_exec = $stmt->execute();
$query= "UPDATE no_exclusive_so SET counter_uu = counter_uu + 1 WHERE parent_id = ".$parent_id.";";
$stmt = $db->prepare($query);
$status_exec = $stmt->execute();
}
}
}
if($is_uu){
$query= "UPDATE unique_u_id SET counter_uu = counter_uu + 1 WHERE parent_id = ".$parent_id.";";
$stmt = $db->prepare($query);
$status_exec = $stmt->execute();
}
}
}
Then starting from this csv sample
Yzc4Jknl0-3c, 34, 12345
Yzc4Jknl0-3c, 34, 67890
Yzc4Jknl0-3c, 34, 12345
01w44Xz0w3c1, 13, 12345
01w44Xz0w3c1, 21, 12345
01w44Xz0w3c1, 21, 12345
qFxg0023Exy4, 21, 67890
At the end of the code the table are showing these results
SELECT * FROM container_u_id;
parent_id | user_id | grandparent_a | grandparent_b
-----------+--------------+---------------+--------------
34 | Yzc4Jknl0-3c | True | True
13 | 01w44Xz0w3c1 | True | False
21 | 01w44Xz0w3c1 | True | False
21 | qFxg0023Exy4 | False | True
SELECT * FROM unique_u_id;
parent_id | counter_uu
-----------+------------
34 | 1
13 | 1
21 | 2
SELECT * FROM no_exclusive_unique_u_id;
parent_id | counter_uu
-----------+------------
34 | 1
Now, let's start to talk my doubts. The script works but it's very slow. For fetching 50 million rows the script spends between 8 and 10 hours. The dev environment is:
- ubuntu 12.04 64 bit
- cqlsh 4.1.1
- Cassandra 2.0.6
- CQL spec 3.1.1
- Thrift protocol 19.39.0
- cpu: i3 M 370
- storage: rotating disk 5400rpm
- Ring: cluster with single node
Thus I'd like to improve the performance...my question is:
- is there anything wrong on my data modeling?
- Do I need to add more nodes and change the hardware(aws ec2)?