3

Memory leakage is detected via memory_profiler. Since such big file will be uploaded from 128MB GCF or f1-micro GCE, how could I prevent this memory leakage?

✗ python -m memory_profiler tests/test_gcp_storage.py
67108864

Filename: tests/test_gcp_storage.py

Line #    Mem usage    Increment   Line Contents
================================================
    48   35.586 MiB   35.586 MiB   @profile
    49                             def test_upload_big_file():
    50   35.586 MiB    0.000 MiB     from google.cloud import storage
    51   35.609 MiB    0.023 MiB     client = storage.Client()
    52                             
    53   35.609 MiB    0.000 MiB     m_bytes = 64
    54   35.609 MiB    0.000 MiB     filename = int(datetime.utcnow().timestamp())
    55   35.609 MiB    0.000 MiB     blob_name = f'test/{filename}'
    56   35.609 MiB    0.000 MiB     bucket_name = 'my_bucket'
    57   38.613 MiB    3.004 MiB     bucket = client.get_bucket(bucket_name)
    58                             
    59   38.613 MiB    0.000 MiB     with open(f'/tmp/{filename}', 'wb+') as file_obj:
    60   38.613 MiB    0.000 MiB       file_obj.seek(m_bytes * 1024 * 1024 - 1)
    61   38.613 MiB    0.000 MiB       file_obj.write(b'\0')
    62   38.613 MiB    0.000 MiB       file_obj.seek(0)
    63                             
    64   38.613 MiB    0.000 MiB       blob = bucket.blob(blob_name)
    65  102.707 MiB   64.094 MiB       blob.upload_from_file(file_obj)
    66                             
    67  102.715 MiB    0.008 MiB     blob = bucket.get_blob(blob_name)
    68  102.719 MiB    0.004 MiB     print(blob.size)

Moreover, if the file is not open with binary mode, the memory leakage will be twice as the file size.

67108864
Filename: tests/test_gcp_storage.py

Line #    Mem usage    Increment   Line Contents
================================================
    48   35.410 MiB   35.410 MiB   @profile
    49                             def test_upload_big_file():
    50   35.410 MiB    0.000 MiB     from google.cloud import storage
    51   35.441 MiB    0.031 MiB     client = storage.Client()
    52                             
    53   35.441 MiB    0.000 MiB     m_bytes = 64
    54   35.441 MiB    0.000 MiB     filename = int(datetime.utcnow().timestamp())
    55   35.441 MiB    0.000 MiB     blob_name = f'test/{filename}'
    56   35.441 MiB    0.000 MiB     bucket_name = 'my_bucket'
    57   38.512 MiB    3.070 MiB     bucket = client.get_bucket(bucket_name)
    58                             
    59   38.512 MiB    0.000 MiB     with open(f'/tmp/{filename}', 'w+') as file_obj:
    60   38.512 MiB    0.000 MiB       file_obj.seek(m_bytes * 1024 * 1024 - 1)
    61   38.512 MiB    0.000 MiB       file_obj.write('\0')
    62   38.512 MiB    0.000 MiB       file_obj.seek(0)
    63                             
    64   38.512 MiB    0.000 MiB       blob = bucket.blob(blob_name)
    65  152.250 MiB  113.738 MiB       blob.upload_from_file(file_obj)
    66                             
    67  152.699 MiB    0.449 MiB     blob = bucket.get_blob(blob_name)
    68  152.703 MiB    0.004 MiB     print(blob.size)

GIST: https://gist.github.com/northtree/8b560a6b552a975640ec406c9f701731

northtree
  • 8,569
  • 11
  • 61
  • 80
  • 1
    Once `blob` goes out of scope, is the memory still in use? – Maximilian Jun 27 '19 at 14:14
  • I have tried with your code (the binary and non binary way) both gave me the same file size. Using memory_profiler I didn't got any memory increment when uploading the blob on either version. Try deleting the blob's memory after uploading it (del blob), or try the "upload_from_filename" method to see if you face the same issue -> https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html#google.cloud.storage.blob.Blob.upload_from_filename . Let me know. – Mayeru Jun 27 '19 at 15:35
  • @Maximilian I suppose the `blob` should be auto released outside `with`. – northtree Jun 28 '19 at 01:06
  • @Mayeru I had run multiple times with python3.7 and `google-cloud-storage==1.16.1` in OS X. Are you running in different env? Thanks. – northtree Jun 28 '19 at 01:09
  • @Mayeru I had also tried `upload_from_filename` method, which will call `upload_from_file` exactly and still have the same memory leakage in my local machine. – northtree Jun 28 '19 at 01:17
  • @Mayeru There is no memory leakage in `Linux`. That issue should be related to `memory_profiler` or `google-cloud-storage` under `OS X`. – northtree Jun 28 '19 at 11:01
  • I was about to confirm that, but was looking for access to an OS X env, I'm using Linux. Thank you for confirming it, in that case the best thing you can do is try to clear the memory after. Have you tried using the "gc" module? -> https://docs.python.org/3/library/gc.html, Also, if you are running the script on the Cloud, just specify a Linux OS on the VM and you would not face the issue – Mayeru Jun 28 '19 at 11:29
  • 2
    Some advice on how to write code for the cloud: 1) You do not have a memory leak unless you have code that is not displayed. 2) You do not want to allocate large blocks of memory to read a file into. 128 MB is big - too big. 3) Internet connections fail, timeout, packets get dropped, have errors, so you want to upload in smaller blocks like 64 KB or 1 MB per I/O with retry logic. 4) Performance is increased by multi-part uploads. Typically, two to four threads will double the performance. I realize that your question is "memory leaks" but write good code and then quality check the good code. – John Hanley Jul 03 '19 at 01:52
  • @JohnHanley Thanks for your suggestion. Those are exactly what `google-cloud-storage` provided. E.g `_do_multipart_upload` -> https://github.com/googleapis/google-cloud-python/blob/fa9ae9861a8881840e267a691eac2c7d18d42ebd/storage/google/cloud/storage/blob.py#L786. The default `chunk_size` is 256KB. -> https://github.com/googleapis/google-cloud-python/blob/fa9ae9861a8881840e267a691eac2c7d18d42ebd/storage/google/cloud/storage/blob.py#L136 – northtree Jul 04 '19 at 00:58
  • @northtree Did you find solution to this ? – vinayhudli Sep 18 '21 at 05:51

1 Answers1

0

To limit the amount of memory used during an upload, you need to explicitly configure a chunk size on the blob before you call upload_from_file():

blob = bucket.blob(blob_name, chunk_size=10*1024*1024)
blob.upload_from_file(file_obj)

I agree this is bad default behaviour of the Google client SDK, and the workaround is badly documented as well.

Pieter Ennes
  • 2,301
  • 19
  • 21