0

I'm having some trouble getting a row count from a temporary hive table. I'm not sure what is actually causing this error because when I run the identical set of queries against smaller test clusters, I get back the expected results. I only see this when running against a large hive cluster.

The code is something like

with hive.connect() as conn:
    conn.execute(f"CREATE TEMPORARY TABLE new_users (uuid String)")
    conn.execute(f"""INSERT INTO new_users (uuid)
                             SELECT uuid FROM big_user_table WHERE <some conditions> """
    resp = conn.execute(f"""SELECT COUNT(*) FROM
                        (SELECT DISTINCT uuid FROM new_users) new_usrs""").fetchone()

I've tried a few variations to get the count but it's really the .fetchone() that is throwing the error.

If someone wants the entire hive stacktrace, I can add that but for now here's just the python side

File "/home/ec2-user/myproject/report.py", line 88, in run_metrics
    (SELECT DISTINCT uuid FROM new_users) new_usrs""").fetchone()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 1276, in fetchone
    e, None, None, self.cursor, self.context
  File "/home/ec2-user/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
    raise value.with_traceback(tb)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 1268, in fetchone
    row = self._fetchone_impl()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 1148, in _fetchone_impl
    return self.cursor.fetchone()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pyhive/common.py", line 105, in fetchone
    self._fetch_while(lambda: not self._data and self._state != self._STATE_FINISHED)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pyhive/common.py", line 45, in _fetch_while
    self._fetch_more()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pyhive/hive.py", line 387, in _fetch_more
    _check_status(response)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/pyhive/hive.py", line 495, in _check_status
    raise OperationalError(response)

where the final hive error states something about a Premature EOF 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:459'], sqlState=None, errorCode=0, errorMessage='java.io.IOException: java.io.EOFException: Premature EOF from inputStream'), hasMoreRows=None, results=None)

Considering the number of large SELECT/INSERT queries that precede this COUNT, I'm having trouble believing that it's a memory issue but I also have no other ideas at this moment.

Thanks.

Lucian Thorr
  • 1,997
  • 1
  • 21
  • 29
  • Oh, thanks! That was just an editing mistake when copying over the code. I'll fix now. – Lucian Thorr Aug 24 '20 at 15:04
  • 1
    The query and traceback included the closing `)`, I had just deleted it from the traceback by accident when editing for the stackoverflow question. The actual table-names are unnecessarily long and I didn't want that to detract from the question. – Lucian Thorr Aug 24 '20 at 15:35

0 Answers0