0

My goal is to iterate over all rows in a specific ColumnFamily in a node.
Here is the php code (using my wrapper over phpcassa):

$ring = $cass_db->describe_ring();

foreach ($ring as $ring_details)
{
    $start_token = $ring_details->start_token;
    $end_token   = $ring_details->end_token;

    if ($start_token != null && $end_token != null)
    {
        $i = 0;
        $batch_size = 10;

        $params = array(
            'token_start' => $start_token,
            'token_finish' => $end_token,
            'row_count'     => $batch_size,
            'buffer_size'   => 1000
        );

        while ($batch = $cass_db->get_range_by_token('myColumnFamily', $params))
        {
            var_dump('Batch# '.$i);

            foreach ($batch as $row)
            {
                $row_key     = $row[0];
                $row_values  = $row[1];
                var_dump($row_key);                 
            }

            $i++;

            //Just to stop infinite loop
            if ($i > 14)
            {
                die(); 
            }

        }
    }
}
  • get_range_by_token() uses default parameters overwritten by $params.

In each batch I get the same 10 row keys.
How to iterate over all existing rows in a large Cassandra DB?

lvil
  • 4,326
  • 9
  • 48
  • 76

1 Answers1

0

I am not a PHP developer so I may misunderstand something in your code. More, you did not specify which cassandra version you are using.

Iteration on all rows is generally done starting and ending with an empty token, and redefining the start token in each iteration. In your code I can't see where you redefine token_start in each iteration. If you don't redefine it you're querying cassandra everytime for the same range of tokens and you will get always the same resultset.

Your code should do something like this ...

start_token = '';
end_token = '';
page_size = 100;
while ( get_range_by_token('cf', start_token, end_token, page_size) {
   // here I should get page_size rows (unless I'm in last iteration or table rows is smaller than page_size elements)
   start_token = rows[rows.size()].getKey();
}

HTH, Carlo

Carlo Bertuccini
  • 19,615
  • 3
  • 28
  • 39
  • I am not sure I can get tokens from smth' like getKey(). I can get the row key, but I don't know how to generate token form it. Is it working with randomPartitioning? – lvil Jul 28 '14 at 08:49
  • Yes, with random partition. Try to set as start key the row key retrieved as last – Carlo Bertuccini Jul 28 '14 at 09:22
  • I can't, because key!==token. I don't have getKey() function, but I can get the row key. I couldn't find a way to convert a key to token. – lvil Jul 28 '14 at 09:46
  • I think get_range is the function you should use: http://thobbs.github.io/phpcassa/api/class-phpcassa.AbstractColumnFamily.html#_get_range following my example using keys instead of tokens – Carlo Bertuccini Jul 28 '14 at 10:03
  • get_range() receives keys, not tokens. As I have randomPartitioning, there is a chance, some rows will be missing – lvil Jul 28 '14 at 10:06
  • Why should you miss some rows? I also use random partitioner but that only means keys are distributed in random order contrary to Order Preserving partitioner. However PHP cassa has a package for it: https://github.com/thobbs/phpcassa/tree/master/lib/phpcassa/Iterator Give a look – Carlo Bertuccini Jul 28 '14 at 10:10
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/58151/discussion-between-lvil-and-carlo-bertuccini). – lvil Jul 28 '14 at 10:25