Questions tagged [apache-kudu]

For questions related to Apache Kudu

From https://kudu.apache.org/docs/

About Kudu

Kudu is a columnar storage manager developed for the Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.

Kudu's design sets it apart. Some of Kudu's benefits include:

  • Fast processing of OLAP workloads.
  • Integration with MapReduce, Spark and other Hadoop ecosystem components.
  • Tight integration with Impala, making it a good, mutable alternative to using HDFS with Parquet.
  • Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict serialized consistency.
  • Strong performance for running sequential and random workloads simultaneously.
  • Easy to administer and manage with Cloudera Manager.
  • High availability. Tablet Servers and Master use the Raft consensus algorithm, which ensures availability even if f replicas fail, given 2f+1 available replicas. Reads can be serviced by read-only follower tablets, even in the event of a leader tablet failure.
  • Structured data model.

By combining all of these properties, Kudu targets support for families of applications that are difficult or impossible to implement on current generation Hadoop storage technologies. A few examples of applications for which Kudu is a great solution are:

  • Reporting applications where newly-arrived data needs to be immediately available for end users
  • Time-series applications that must simultaneously support:
    • queries across large amounts of historic data
    • granular queries about an individual entity that must return very quickly
  • Applications that use predictive models to make real-time decisions with periodic refreshes of the predictive model based on all historic data
134 questions
1
vote
4 answers

Best practice for high-volume transactions with real time balance updates

I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and…
Rogexx
  • 13
  • 4
1
vote
1 answer

Too much disk space used by Apache Kudu for WALs

I have a hive table that is of 2.7 MB (which is stored in a parquet format). When I use impala-shell to convert this hive table to kudu, I notice that the /tserver/ folder size increases by around 300 MB. Upon exploring further, I see it is the…
Zzrot
  • 304
  • 2
  • 4
  • 20
1
vote
1 answer

Filtering a specific row in kudu using kudu scanner

The target table in kudu is huge. I have the following in scala and I would like to check if the row exists in kudu. These four columns are primary keys in kudu table but when I define a upper bound I seem to get all the rows. How do I select a…
user3897533
  • 417
  • 1
  • 8
  • 24
1
vote
1 answer

Long range rolling window aggregations - time series kudu vs influxdb vs opentsdb

I'm looking to do some analysis on a large set of customer transaction data. We have millions of transaction events come in with some quantity and timestamp value for various entities; { "txId": "tx123" "item": "i87" "qty": 3 "time":…
NightWolf
  • 7,694
  • 9
  • 74
  • 121
1
vote
1 answer

ERROR: AnalysisException: A data distribution must be specified using a DISTRIBUTE BY clause

While following the kudu QuickStart at http://kudu.apache.org/docs/quickstart.html I encountered the error "ERROR: AnalysisException: A data distribution must be specified using a DISTRIBUTE BY clause." while trying to create the kudu table…
0
votes
0 answers

How can I continuously read data from Apache Kudu in real-time using Apache Flink?

I need to read data with Apache Flink from Apache Kudu Database in realtime. My use-case is: I receive a message from Kafka, deserialize that message and get an ID. If ID exists is in the database, I ignore it If isn't, I need to add it in…
0
votes
0 answers

KuduContext with pyspark

I trying to upsert rows using pyspark with kuducontext. I successfully doing it by "append" mode but I couldn't use kuducontext method such as upsertrows...
yorrbo
  • 1
  • 1
0
votes
1 answer

Deleting kudu range partitions less than the given string

I want to delete all Kudu RANGE partitions from the kudu table which has partition value less than a given date string. I am using following query but it's not working. Can someone please suggest what is the workaround. alter table test_table drop…
Akanksha_p
  • 916
  • 12
  • 20
0
votes
0 answers

Two UUID tablet server after restart where the wal directory was lost

We faced the problem on our production kudu cluster. The hard disk with wal catalog was failed on the tablet server. We install new disk and clear data directory according to Kudu documentation…
0
votes
0 answers

How to create Kudu table from pyspark dataframe

Am trying a simple approach to write a datafram from pyspark and into a non-existing kudu table df.write.format('org.apache.kudu.spark.kudu') \ .option('kudu.master', kudu_master) \ .option('kudu.table', kudu_table) \ …
Exorcismus
  • 2,243
  • 1
  • 35
  • 68
0
votes
2 answers

Impala Delta Lake Integration

I have set up Delta Lake in Cloudera. It works fine with Spark and Hive. I have searched enough on the internet to integrate Delta Lake with Impala. I did not find much information. Can someone please answer if you have done the same? Update: Do not…
vijayinani
  • 2,548
  • 2
  • 26
  • 48
0
votes
1 answer

Check if table has RANGE partition

Is it possible to list type of partitions(HASH,RANGE etc) applied on a given Kudu table through a query? I need to check if that table contains RANGE partition or not.
Akanksha_p
  • 916
  • 12
  • 20
0
votes
0 answers

Spark Job slowness

Whenever I am running spark job with below parameters it is getting slow down. spark-submit --conf spark.sql.shuffle.partitions=100 --master yarn --deploy-mode cluster --conf spark.dynamicAllocation.enabled=true --conf…
0
votes
0 answers

How to write string literals in sql query in springboot application.yaml file?

I am trying to place sql query to read data from kudud table in applciation.yaml file where string literal is used. But while running the program it is giving the parsing error as below - EL1043E: Unexpected token. Expected 'rcurly(})' but was…
Tinku
  • 53
  • 7
0
votes
1 answer

Query result as variable in another using jdbc

Because I want to optimize a query, I want to rennounce at a join. Due of that, I need to declare a variable before the main query, but I can't find a solution to use it in jdbc statement. Original query: SELECT d.orders SUM(price * qty) /…
AlleXyS
  • 2,476
  • 2
  • 17
  • 37
1 2 3
8 9