Questions tagged [apache-kudu]

For questions related to Apache Kudu

About Kudu

Kudu is a columnar storage manager developed for the Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.

Kudu's design sets it apart. Some of Kudu's benefits include:

Fast processing of OLAP workloads.

Integration with MapReduce, Spark and other Hadoop ecosystem components.

Tight integration with Impala, making it a good, mutable alternative to using HDFS with Parquet.

Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict serialized consistency.

Strong performance for running sequential and random workloads simultaneously.

Easy to administer and manage with Cloudera Manager.

High availability. Tablet Servers and Master use the Raft consensus algorithm, which ensures availability even if f replicas fail, given 2f+1 available replicas. Reads can be serviced by read-only follower tablets, even in the event of a leader tablet failure.

Structured data model.

By combining all of these properties, Kudu targets support for families of applications that are difficult or impossible to implement on current generation Hadoop storage technologies. A few examples of applications for which Kudu is a great solution are:

Reporting applications where newly-arrived data needs to be immediately available for end users

Time-series applications that must simultaneously support:

queries across large amounts of historic data

granular queries about an individual entity that must return very quickly

Applications that use predictive models to make real-time decisions with periodic refreshes of the predictive model based on all historic data

134 questions

vote

4 answers

Best practice for high-volume transactions with real time balance updates

I currently have a MySQL database which deals a very large number of transactions. To keep it simple, it's a data stream of actions (clicks and other events) coming in real time. The structure is such, that users belong to sub-affiliates and…

asked Feb 25 '17 at 20:02

Rogexx

vote

1 answer

Too much disk space used by Apache Kudu for WALs

I have a hive table that is of 2.7 MB (which is stored in a parquet format). When I use impala-shell to convert this hive table to kudu, I notice that the /tserver/ folder size increases by around 300 MB. Upon exploring further, I see it is the…

apache-kudu

asked Feb 21 '17 at 20:53

Zzrot

vote

1 answer

Filtering a specific row in kudu using kudu scanner

The target table in kudu is huge. I have the following in scala and I would like to check if the row exists in kudu. These four columns are primary keys in kudu table but when I define a upper bound I seem to get all the rows. How do I select a…

apache-kudu

asked Dec 01 '16 at 23:12

user3897533

vote

1 answer

Long range rolling window aggregations - time series kudu vs influxdb vs opentsdb

I'm looking to do some analysis on a large set of customer transaction data. We have millions of transaction events come in with some quantity and timestamp value for various entities; { "txId": "tx123" "item": "i87" "qty": 3 "time":…

hadoop apache-flink influxdb opentsdb apache-kudu

asked Oct 31 '16 at 06:12

NightWolf

7,694
9
74
121

vote

1 answer

ERROR: AnalysisException: A data distribution must be specified using a DISTRIBUTE BY clause

While following the kudu QuickStart at http://kudu.apache.org/docs/quickstart.html I encountered the error "ERROR: AnalysisException: A data distribution must be specified using a DISTRIBUTE BY clause." while trying to create the kudu table…

impala apache-kudu

asked Jul 03 '16 at 19:12

Abdurrahman Adebiyi

votes

0 answers

How can I continuously read data from Apache Kudu in real-time using Apache Flink?

I need to read data with Apache Flink from Apache Kudu Database in realtime. My use-case is: I receive a message from Kafka, deserialize that message and get an ID. If ID exists is in the database, I ignore it If isn't, I need to add it in…

apache-flink flink-streaming flink-sql apache-kudu flink-table-api

asked May 24 '23 at 22:18

marcos

votes

0 answers

KuduContext with pyspark

I trying to upsert rows using pyspark with kuducontext. I successfully doing it by "append" mode but I couldn't use kuducontext method such as upsertrows...

apache-spark pyspark kudu apache-kudu

asked Jan 14 '23 at 15:47

yorrbo

votes

1 answer

Deleting kudu range partitions less than the given string

I want to delete all Kudu RANGE partitions from the kudu table which has partition value less than a given date string. I am using following query but it's not working. Can someone please suggest what is the workaround. alter table test_table drop…

impala kudu apache-kudu

asked Dec 15 '22 at 18:11

Akanksha_p

votes

0 answers

Two UUID tablet server after restart where the wal directory was lost

We faced the problem on our production kudu cluster. The hard disk with wal catalog was failed on the tablet server. We install new disk and clear data directory according to Kudu documentation…

apache-kudu

asked Oct 21 '22 at 14:46

Dmitriy Kochetkov

votes

0 answers

How to create Kudu table from pyspark dataframe

Am trying a simple approach to write a datafram from pyspark and into a non-existing kudu table df.write.format('org.apache.kudu.spark.kudu') \ .option('kudu.master', kudu_master) \ .option('kudu.table', kudu_table) \ …

apache-spark pyspark kudu apache-kudu

asked Oct 20 '22 at 15:07

Exorcismus

2,243
1
35
68

votes

2 answers

Impala Delta Lake Integration

I have set up Delta Lake in Cloudera. It works fine with Spark and Hive. I have searched enough on the internet to integrate Delta Lake with Impala. I did not find much information. Can someone please answer if you have done the same? Update: Do not…

apache-spark hive impala delta-lake apache-kudu

asked Oct 10 '22 at 16:03

vijayinani

2,548
2
26
48

votes

1 answer

Check if table has RANGE partition

Is it possible to list type of partitions(HASH,RANGE etc) applied on a given Kudu table through a query? I need to check if that table contains RANGE partition or not.

sql impala apache-kudu

asked Aug 31 '22 at 17:38

Akanksha_p

votes

0 answers

Spark Job slowness

Whenever I am running spark job with below parameters it is getting slow down. spark-submit --conf spark.sql.shuffle.partitions=100 --master yarn --deploy-mode cluster --conf spark.dynamicAllocation.enabled=true --conf…

dataframe apache-spark apache-spark-sql bigdata apache-kudu

asked Jul 15 '22 at 04:01

Ashish Rana

votes

0 answers

How to write string literals in sql query in springboot application.yaml file?

I am trying to place sql query to read data from kudud table in applciation.yaml file where string literal is used. But while running the program it is giving the parsing error as below - EL1043E: Unexpected token. Expected 'rcurly(})' but was…

sql spring-boot apache-kudu

asked Jul 01 '22 at 20:49

Tinku

votes

1 answer

Query result as variable in another using jdbc

Because I want to optimize a query, I want to rennounce at a join. Due of that, I need to declare a variable before the main query, but I can't find a solution to use it in jdbc statement. Original query: SELECT d.orders SUM(price * qty) /…

sql scala jdbc apache-kudu

asked Oct 05 '21 at 13:50

AlleXyS

2,476
2
17
37

Prev 1 2 3

…

8 9 Next