0

I mounted my google drive in my colab notebook, and I have a fairly big pandas dataframe and try to mydf.to_feather(path) where path is in my google drive. it is expected to be 100meg big and it is taking forever.

Is this to be expected? it seems the network link between colab and google drive is not great. Anyone know if the servers are in same region/zone?

I may need to change my workflow to avoid this. If you have any best practice or suggestion, pls let me know, anything short of going all GCP (which I expect don't have this kind of latency).

kawingkelvin
  • 3,649
  • 2
  • 30
  • 50
  • This appears very sporadic. I saved another bigger dataframe to_feather(...) and this time it is much much faster. – kawingkelvin Jun 05 '19 at 21:57
  • Without having a look at your code it's anyones's guess... – Adonis Jun 06 '19 at 22:18
  • I am doing absolutely "default" thing. If you have seen this happening to .to_feather(...), you probably don't need to see my code to reproduce. I have a workaround I posted below, and a guess to whats going on. – kawingkelvin Jun 07 '19 at 20:13

1 Answers1

1

If you find calling df.to_feather("somewhere on your gdrive") from google colab and it is on the order of ~X00mb, you may find sporadic performance. It can take anywhere between a few min to a whole hour to save a file. I can't explain this behavior.

Workaround: First save to /content/, the colab's host machine's local dir. Then copy the file from /content to your gdrive mount dir. This seems to work much more consistently and faster for me. I just can't explain why .to_feather directly to gdrive suffer so much.

kawingkelvin
  • 3,649
  • 2
  • 30
  • 50
  • I believe this behavior maybe specific to pandas feather format. It seems to have lot of write overhead. Saving to csv does not seem to have any issues. – kawingkelvin Jun 07 '19 at 20:11
  • I don't have enough evidence. But I have a hunch that to_feather(....) somehow provoke lot of network overhead. So saving it first locally and then do the usual cp may have avoid those overheads. – kawingkelvin Jun 11 '19 at 21:17