1

I have a piece of code that creates several threads on a Glue job like this:

            threads = []
            for data_chunk in data_chunks:
                json_data = get_bulk_upload_json(data_chunk)
                threads.append(Thread(target=my_func, args=(arg1, arg2)))

            for thread in threads:
                thread.start()

            for thread in threads:
                thread.join()

Where data_chunks is a list of dictionaries. Due to the nature of the data, this consumes a lot of memory. The Glue job keeps failing on a memory error, however, after further debugging, it is crashing once it reached the memory limit of just one of the workers. Meaning, it is not using the memory of the other workers at all. Another proof of this, is that not matter how many more workers I add, the same error happens in the same part of the process.

How can I use threads and distribute them between the workers?

rodrigocf
  • 1,951
  • 13
  • 39
  • 62

1 Answers1

4

It seems that you are misusing AWS Glue.

You shouldn't use Threads, since Glue does the parallelization for you. It is a managed version of Spark. Instead you should use Spark / Glue functions, which will then be executed on the workers.

Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42