0

Using spark-submit cmd(Spark2 CDH 5.9) to run a python script, I am getting the following json decoding error only in cluster mode (client is fine):

e.g.
Traceback (most recent call last):
  File "dummy.py", line 115, in <module>
    cfg = json.loads(args.params)
  File "/opt/cloudera/parcels/Anaconda/envs/py35/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/opt/cloudera/parcels/Anaconda/envs/py35/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/cloudera/parcels/Anaconda/envs/py35/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 863 (char 862)

I understand the above error is due to invalid json. However, the json passed to the script is valid (Explained ahead). It seems spark-submit cluster mode is modifying the json argument for the python script. I compared logs in both "client" and "cluster" mode, and found that the json arg in client stays as is, whereas in cluster mode gets modified.

The json that I am passing has a structure like this:

{
    "X": {
        "A": {
            "a": "b",
            "c": "d"
        }
    },
    "Y": ["e", "f"],
    "Z": "g"
}

Client mode receives it as is, where Cluster mode gets the following:

{
    "X": {
        "A": {
            "a": "b",
            "c": "d",
    "Y": ["e", "f"],
    "Z": "g"
}

this seems a very odd behavior. Any insights would be really helpful.

trailblazer
  • 215
  • 1
  • 3
  • 10
  • How do you pass the argument (json) to the spark-submit? – Mariusz Sep 02 '17 at 13:46
  • @Mariusz as json blob arg to python script, and read it via argparse – trailblazer Sep 06 '17 at 19:49
  • Could you please update the question with commands you use to run the spark-submit with the argument passed? Did you try looking what is inside `sys.argv` before loading argparse? – Mariusz Sep 07 '17 at 07:19

0 Answers0