1

How do I read a large table from hdfs in jupyter-notebook as a pandas DataFrame? The script is launched through the docker image.

libraries:

  • sasl==0.2.1
  • thrift==0.11.0
  • thrift-sasl==0.4a1
  • Impyla==0.16.2
from impala.dbapi import connect 
from impala.util import as_pandas

impala_conn = connect(host='hostname', port=21050,
auth_mechanism='GSSAPI', 
                      timeout=100000, use_ssl=True, ca_cert=None, 
                      ldap_user=None, ldap_password=None, 
                      kerberos_service_name='impala')

This works.


import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 100", impala_conn)
print(df)

This does not work. The operation hangs, does not give errors.


import pandas as pd
df = pd.read_sql("select id, crt_mnemo from demo_db.stg_deals_opn LIMIT 1000", impala_conn)
print(df)

CypherX
  • 7,019
  • 3
  • 25
  • 37
MacJei
  • 11
  • 1
  • Your code is the exact same? And why impala+pandas instead of pyspark? – OneCricketeer Mar 02 '20 at 15:02
  • @cricket_007 why woud i lie about that? The to_pandas method in pyspark outputs only 10,000 rows, and throws an error. – MacJei Mar 04 '20 at 10:08
  • Oh, sorry, I missed the limit was different. Uhm, well, it could be a memory problem. What exactly is the error? What does your Spark code look like? – OneCricketeer Mar 04 '20 at 13:40
  • @cricket_007 thank you for wanting to help. Spark code: `pdf = sqlContext.sql("""table name""").toPandas() print(pdf)`. Error: `Py4JJavaError: An error occurred while calling o121.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 5.0 failed 4 times, most recent failure: Lost task 2.3 in stage 5.0 (TID 20, hadoop server, executor 1): TaskResultLost (result lost from block manager) Driver stacktrace` – MacJei Mar 10 '20 at 12:22
  • Well, clearly as soon as you use `toPandas` you're bringing all data to a single machine, defeating the purpose of Spark... If you just want to print the dataframe, it's `sqlContext.table("""table name""").show()`. Any pandas operation you should be translating that to Spark – OneCricketeer Mar 10 '20 at 13:01

1 Answers1

0

This seems to be a problem with the number of rows you can move from impala using pandas read_sql function. I have the same problem but with lower limits than yours. You may want to contact the database admin to check the size. Here are other options: https://docs.cloudera.com/machine-learning/cloud/import-data/topics/ml-running-queries-on-impala-tables.html

Dharman
  • 30,962
  • 25
  • 85
  • 135
Jorge
  • 2,181
  • 1
  • 19
  • 30