0

In my case, I need to load impala data to spark(pyspark). Because I want to use FPGrowth of spark mllib.

Data is in kudu and it was made by impala. Connecting to directly kudu on spark was rejected by a relevant department. And I also failed connecting with impala jdbc made by cloudera.
So my last choice is

  1. Load data with ibis (https://github.com/ibis-project/ibis)
  2. Convert ImpalaTable to spark's Dataframe

But I couldn't find a way.
Do I think wrong?

  • Hi. Have you check this way => https://medium.com/@sciencecommitter/how-to-read-from-and-write-to-kudu-tables-in-pyspark-via-impala-c4334b98cf05 But you first need to make an access to kudu throught impala. – airliquide Oct 26 '21 at 08:25
  • 1
    @airliquide, I've seen that post, and I retried it. And I finally found I had a firewall problem on data nodes. So that's why I couldn't query to kudu tables (timeout error) while I could get infos. Thanks a lot!!!!! – JEONGHYEON OH Oct 27 '21 at 12:41

1 Answers1

0

Previously, this way was not worked for me.
I could get schema of tables, but I couldn't query because of timeout.

And I finally found a problem. My problem caused by firewall.
I opened ports of only master nodes, but I needed to open ports of data nodes.
And now everything is fine.