2

I'm looking for a solution, that will allow me to colocate the code with the data. As a Db I have Cassandra and would like to be able to get the data, that is on a specific node.

The importance here is that I try to achieve it from my own code without using frameworks, as Hadoop or Spark.

I wounder, if someone could explain or provide a link, as I have not found yet a solution. The question here is, how that could be achieved with Cassandra.

Thanks in advance

Dr.Khu
  • 665
  • 5
  • 17
  • Why do you want to do this? – Don Branson Sep 18 '14 at 15:02
  • I would like to process big data in real time, I hope to keep data in memory, as there is such a feature.But I still need distributed computation and because of this, data locality, to achieve real-time goal – Dr.Khu Sep 18 '14 at 15:08
  • Is is write-heavy or read-heavy? – Don Branson Sep 18 '14 at 15:18
  • It's read-heavy (but Cassandra is something I have to live with) – Dr.Khu Sep 18 '14 at 15:24
  • 1
    I'd say make it an in memory table and set set the replication so it lives on every node. It may also be helpful not to denigrate a tool which is a favorite of the people that are trying to help you. – Don Branson Sep 18 '14 at 15:27
  • Thanks, Don, I don't have anything against Spark & Hadoop, they are great tools, but i this case may be not suitable. Ragarding the replication to all the nodes, that is not an option here, because size of my data is in TBs and it can't be kept in RAM of a single machine – Dr.Khu Sep 18 '14 at 15:31
  • That's why I'm looking for data locality solution, if it is ever possible with Cassandra – Dr.Khu Sep 18 '14 at 15:31

1 Answers1

2

Warning this is most likely not what you should be doing:

The easiest way to do this would be to use the byte order partitioner. This places data on nodes based on the actual byte ordering of the primary key's rather than using a hash. This technique is for experts only and removes many of the benefits of Cassandra and only should be used by those who truly understand the tradeoffs. ByteOrderedPartitioner also places significantly more burden on the application designer and the sysops team because the system will now not be expandable in an easy to understand way.

Using Spark or Hadoop is the correct way to deal with this:

The real solution is to use Hadoop or Spark. You could get the data locality you are looking for by attempting to read sstables directly from the disk. An example of this: http://www.fullcontact.com/blog/cassandra-sstables-offline/

Community
  • 1
  • 1
RussS
  • 16,476
  • 1
  • 34
  • 62
  • Thanks, RussS, for an answer. I would leave the 1st variant as an option. Talking about the 2nd one, what I'm trying to achieve is processing of the big data in real time. I'm looking for a solution based on Cassandra, where performance is comparable with IMDG. But here I try to use in-memory option of cassandra (provided by DataStax) so I don't need IMDG, but still need distributed computation and because of that data and code colocation. So I believe, I won't use Spark's cache functionality and because of that Spark looks as a wrong tool here, where I could have Akka cluster insted. – Dr.Khu Sep 18 '14 at 14:02
  • The only question is, how to get the data locality, when data cassandra stores data in memory. Obviously, there has to be a functionality in cassandra, that provides that feature. I think it is reasonable to look, how the locality is achieved in cassandra-spark-connector, but, probably, community could help with that... – Dr.Khu Sep 18 '14 at 14:06
  • The Spark OSS connector makes a best attempt at locality but does not actually guarantee it since under the hood it ends up using the same Datastax Java driver as any other client would use. The Datastax in memory option is also limited to jvm heap at the moment so it won't allow a significant amount of data until the 2.1 release. – RussS Sep 18 '14 at 17:51