1

Impala currently saves query profile logs at /var/log/impala/profiles , per line in the format

<Epoch-Timestamp> <QueryID> <zlib-compressed-data>

As mentioned in their document at https://impala.apache.org/docs/build/html/topics/impala_logging.html

"To save space, those query profiles are now stored in zlib-compressed files in /var/log/impala/profiles."

I want to decode/decompress the zlib-compressed data in human readable format using some utility instead of Web UI exposed at 25000.

From the logs and documentation I have been able to figure out that the zlib-compressed data has been encoded using base64. I was able to write a python code to decompress the zlib-compression,

import base64
import datetime
import zlib

profile_data = "1587093056765 c94ef1f2e35015a2:feb1867165d545a7 eJyVVE1P1EAYLkhkXRBcI7ANMRlvkMhm+rXd3ehh2S1xI4LS5eNgYqadtzChtMt0yoe/gHg1MR78Df4AD540MSYevHry6MEf4NEpFVwImJgmk74znfd5n+d536r3ynefpMAP0Qyj9/26CYEW6GBYWLOI3gjA02pVW6ta1DItYs9ODKl3y6VORHzB9qAbCxJ22Q5MFCcVpXztbDw5UJpW1MK0PBl2050dwg8nlP+7Xjoaa8VRBL4AilYT4EM8jsVIK445ZRERMVd3U+ZvJ4JwUfHDOKXASUPXMcaFdnsRdQ97MNpacZpd51m3Ob/oFNsQkDQUqO2NwEEPuISLRDLWhhA2yQmMUnAiirJSbutYx3PYnNNshI2GhhtWtWJXzVrNkBhjnZ0eCQlaA56wOFpnxyFFe3mM9IpVwXM+3bIqdgWjFWfRaboOmvFSFlJEqBFYhBIPaGBVA6h6tA5AjLpm2j6u2r5vWbqNjdnxJRD7Md9GTUo5JMm4pst08tEahqFZ+nRu4XJPSNAEzUQSmuY8Z5WR/NAVkl1hobPUcR847dG/m2kyuPywmMeZXFekbkVXomQEOu07dcvEZl2npmlbdc8LGnUc4JqODewR3dAIjJ58nN0ennccd725cd3dDXPMTN8pn4N8RYJ4oVw1NHOAWCRmi25m3OVCm5ptWpmZQ6fmq79KfdWdwe7LdupfH7F+Ic7wP+fiMda5vjvXH+cN6euqs8T7W/VfLp0263Q2IYp6VS0o5bE/tUsaIYtAXVN+vvy++vrNx0+Db799+Zwv6sZ4ThsOwE+z1KXHIYkiFm2igEUs2QJ6YwV2U0jE6UZpgXEZ8ngfBSD87JPViMMmSwRwtJvByoEczXVxgct+lqP7tHyzReSIxpvLPUeiZYVxOasfvr77MaCUb7VCJikvZAnXCRMnx6/eHw0ql0791Eq8/0iqxRkJ2XOSETi5ePkvZeCFUruglgsruAxAUX4DOJOTpw=="

pdata = profile_data.split(" ")

ts = datetime.datetime.fromtimestamp(int(pdata[0]) / 1000.0).isoformat()
queryID = pdata[1]

encodedData = base64.b64decode(pdata[2])
zlib_data = zlib.decompress(encodedData)

print(zlib_data)

The above Python utility gives the following output,which has some meaningful information but not complete.

b'\x19<\x18,Query (id=c94ef1f2e35015a2:feb1867165d545a7)\x15\x04\x19,\x18\x11InactiveTotalTime\x15\n\x16\x00\x00\x18\tTotalTime\x15\n\x16\x00\x00\x16\x01\x11\x1b\x00\x19\x08\x1b\x00\x00\x18\x07Summary\x15\x00\x19,\x18\x11InactiveTotalTime\x15\n\x16\x00\x00\x18\tTotalTime\x15\n\x16\x00\x00\x16\x01\x11\x1b\x11\x88\x0eConnected User\x04root\x0bCoordinator\x19quickstart.cloudera:22000\x08DDL Type\x0cCREATE_TABLE\nDefault Db\x0bexperiments\x0eDelegated User\x00\x08End Time\x1d2020-04-17 03:10:56.764883000\x0eImpala VersionWimpalad version 2.5.0-cdh5.7.0 RELEASE (build ad3f5adabedf56fe6bd9eea39147c067cc552703)\x0fNetwork Address\x0f127.0.0.1:33152\x1bQuery Options (non default)\x00\x0bQuery State\x08FINISHED\x0cQuery Status\x02OK\nQuery Type\x03DDL\nSession ID!9540492d44759bbf:90f082030ba231ae\x0cSession Type\x07BEESWAX\rSql Statement\x17create table t1 (x int)\nStart Time\x1d2020-04-17 03:10:56.417452000\x04User\x04root\x19\xf8\x11\nSession ID\x0cSession Type\nStart Time\x08End Time\nQuery Type\x0bQuery State\x0cQuery Status\x0eImpala Version\x04User\x0eConnected User\x0eDelegated User\x0fNetwork Address\nDefault Db\rSql Statement\x0bCoordinator\x1bQuery Options (non default)\x08DDL Type\x1b\x00\x19,\x18\x00\x19\x06\x19\x08\x00\x18\x0eQuery Timeline\x19V\x00\xec\x93\xe0U\x98\x9c\xc5\xc8\x02\xae\xda\xcd\xca\x02\xae\xda\xcd\xca\x02\x19X\x0fStart execution\x11Planning finished\x10Request finished\x11First row fetched\x10Unregister query\x00\x00\x18\x0cImpalaServer\x15\x00\x19\\\x18\x12CatalogOpExecTimer\x15\n\x16\xc4\xd1\xba\xe8\x01\x00\x18\x14ClientFetchWaitTimer\x15\n\x16\x96\xbe\x88\x02\x00\x18\x11InactiveTotalTime\x15\n\x16\x00\x00\x18\x17RowMaterializationTimer\x15\n\x16\x00\x00\x18\tTotalTime\x15\n\x16\x00\x00\x16\x01\x11\x1b\x00\x19\x08\x1b\x01\x8a\x008\x12CatalogOpExecTimer\x14ClientFetchWaitTimer\x17RowMaterializationTimer\x00\x00'

Any pointers to understand/parse the Impala profile log programmatically would be really appreciable.

sumit kumar
  • 150
  • 1
  • 2
  • 13
  • Just a pointer: Impala source on git https://github.com/apache/impala/blob/master/be/src/kudu/util/zlib.cc#L76 and try to follow the compression logic. – mazaneicha Apr 22 '20 at 15:34
  • But wouldnt it be easier to get whatever details you're looking for through REST from impalad/coordinator that ran a query? https://:25000/query_profile?query_id= – mazaneicha Apr 22 '20 at 15:56
  • @mazaneicha while rest is easy but the API ports have been blocked so only have the option to parse the file. – sumit kumar May 03 '20 at 21:12

2 Answers2

1

Upd. Edited to include source code and some comments.

I guess this is the script that you're looking for (taken from Impala Git here):

    import base64
    import datetime
    import zlib
    from thrift.protocol import TCompactProtocol
    from thrift.TSerialization import deserialize
    from RuntimeProfile.ttypes import TRuntimeProfileTree

    def decode_profile_line(line):
      space_separated = line.split(" ")
      if len(space_separated) == 3:
        ts = int(space_separated[0])
        print datetime.datetime.fromtimestamp(ts / 1000.0).isoformat(), space_separated[1]
        base64_encoded = space_separated[2]
      elif len(space_separated) == 1:
        base64_encoded = space_separated[0]
      else:
        raise Exception("Unexpected line: " + line)
      possibly_compressed = base64.b64decode(base64_encoded)
      # Handle both compressed and uncompressed Thrift profiles
      try:
        thrift = zlib.decompress(possibly_compressed)
      except zlib.error:
        thrift = possibly_compressed

      tree = TRuntimeProfileTree()
      deserialize(tree, thrift, protocol_factory=TCompactProtocol.TCompactProtocolFactory())
      tree.validate()

    return tree

    encoded_line = "timestamp queryid encoded_profile_string"
    decoded_tree = decode_profile_line(encoded_line)

Basically, it does the same thing as you did in your snippet, but it also deciphers Thrift-specific encoding afterwards. "thrift" stands for libraries from Apache Thrift itself (more info in Apache Thrift docs and Git), and RuntimeProfile is Impala's structure definition (you can view it in Impala Git) that contains nodes and summary of script execution:

    // A flattened tree of runtime profiles, obtained by an
    // pre-order traversal
    struct TRuntimeProfileTree {
      1: required list<TRuntimeProfileNode> nodes
      2: optional ExecStats.TExecSummary exec_summary
    }

Let's look closer at "deserialize" method in Thrift:

    def deserialize(base,
            buf,
            protocol_factory=TBinaryProtocol.TBinaryProtocolFactory()):
      transport = TTransport.TMemoryBuffer(buf)
      protocol = protocol_factory.getProtocol(transport)
      base.read(protocol)
      return base

"base" stands for a tree that will be returned to you, with all of its' nodes and subnodes. So, there are two options to dig from here on:

  1. Browse through Impala code; it seems that all of the necessary definitions are stored in this subfolder of their Git; it's possible to replicate their logic.
  2. Write custom "deserialize" and debug output from "protocol_factory.getProtocol(transport)" to define "base" tree (again: it's probably a bit easier to look necessary nodes in Impala source code).
  • Please post the script code directly in your answer – Russ J May 03 '20 at 02:03
  • @Vladimir-yurev , Thanks for suggestion, the repository uses 'RuntimeProfile' library which somehow I am unable to find or build. While I am making an effort to find out how can I get 'RuntimeProfile'. If you are familiar with the project your help would be highly appreciable. https://github.com/apache/impala/blob/53ff6f9bf5ca907f15ec1187eb5d4007d46eb61e/lib/python/impala_py_lib/profiles.py – sumit kumar May 03 '20 at 21:10
  • Edited original post; hope it helps. – Vladimir Yurev May 04 '20 at 09:40
1

This might help other people coming across this. There's an example script for parsing runtime profile logs in the Apache Impala repository that is a good place to start with building your own tool - https://github.com/apache/impala/blob/master/bin/parse-thrift-profile.py. It's tied into the Impala development environment - specifically it needs to compile the thrift definitions and generate Python from them, which requires a working development environment. As of writing you can get this working by checking out Impala and setting up part of the environment:

git clone https://github.com/apache/impala.git
cd impala
. bin/impala-config.sh
./bin/bootstrap_toolchain.py
./buildall.sh -cmake_only -noclean
make thrift-deps
./bin/parse-thrift-profile.py ./impala_profile_log_1.1-1594189561854