GREAT question!
1) What are the differences between these commands ?
Running a nodetool snapshot
creates a hard-link to the SSTable files on the requested keyspace. It's the same as running this from the (Linux) command line:
ln {source} {link}
A cqlsh COPY
is essentially the same as doing a SELECT * FROM
on a table. It'll create a text file with the table's data in whichever format you have specified.
In terms of their difference from a backup context, a file created using cqlsh COPY
will contain data from all nodes. Whereas nodetool snapshot
needs to be run on each node in the cluster. In clusters where the number of nodes is greater than the replication factor, each snapshot will only be valid for the node which it was taken on.
2) Which one is most appropriate ?
It depends on what you're trying to do. If you simply need backups for a node/cluster, then nodetool snapshot
is the way to go. If you're trying to export/import data into a new table or cluster, then COPY
is the better approach.
Also worth noting, cqlsh COPY
takes a while to run (depending on the amount of data in a table), and can be subject to timeouts if not properly configured. nodetool snapshot
is nigh instantaneous; although the process of compressing and SCPing snapshot files to an off-cluster instance will take some time.
3) Should we employ the same technique of flushing the data if we use the cqlsh
copy command ?
No, that's not necessary. As cqlsh COPY
works just like a SELECT
, it will follow the normal Cassandra read path, which will check structures both in RAM and on-disk.