0

I've a running apache-cassandra-2.2.1 with enable_user_defined_functions set to true in cassandra.yml. I've defined a custom aggregation based on this article as follows:

CREATE FUNCTION sumFunc(current double, candidate double) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'if(current == null) return candidate; return current + candidate;'
CREATE AGGREGATE sum(double) SFUNC sumFunc STYPE double INITCOND null;

When I call this from CQLSH console I see a timeout:

cqlsh:test> SELECT word, sum(frequency) FROM words;
OperationTimedOut: errors={}, last_host=127.0.0.1

I can successful run any other query, I can also run the query (but I don't get full result set) from scala:

CassandraConnector(conf).withSessionDo { session =>
  val result: ResultSet = session.execute("SELECT word, SUM(frequency) FROM test.words;")
  while(result.isExhausted == false) {
    println(result.one)
  }
}
bachr
  • 5,780
  • 12
  • 57
  • 92

1 Answers1

1

First off your query may not do what you expect since it won't group by each word in the table. You would get a sum of all the frequencies in the table. To get the sum of the frequencies for a word you'd need to do this:

SELECT word, sum(frequency) FROM words WHERE word='someword';

Second, I have seen time out errors when trying to aggregate large partitions over about 300,000 rows (see this). So it's possible your table of words is too large to be aggregated before the timeout error kicks in. I wish Cassandra wouldn't time out on queries that are making progress, but it seems to have some hardcoded timeouts that will abort tasks regardless of if they are progressing or are actually stuck.

Since your query doesn't have a WHERE clause, you are trying to aggregate a whole table rather than just a single partition. That would be a lot more likely to cause timeout errors since the aggregation will take place on data spread on multiple nodes instead of a single node, so you should try to restrict aggregation to a single partition.

I would think for INITCOND you'd want to use 0 rather than null.

The name of your aggregate might conflict with the built in system sum function, so you might want to pick a different name. But offhand it looks like you could use the built in sum function instead of defining one (Cassandra 2.2 has undocumented built in functions for sum(), avg(), min(), and max()).

Community
  • 1
  • 1
Jim Meyer
  • 9,275
  • 1
  • 24
  • 49
  • Indeed, I've a large table (more than 0.3M rows) but why the same query return results when run with the scala driver? (I though the aggregates where only available for 3.0, will check this thnx) – bachr Sep 14 '15 at 15:23
  • There are probably some performance and functional differences between cqlsh and using a client application. The cqlsh interface may be more restrictive in some ways. – Jim Meyer Sep 14 '15 at 15:37