1

I am using Google Colab to do the following tasks, but it just didn't work. My scripts worked well when I tested on small folders having fewer than 10 files; however, they didn't work for larger files having thousands of files. On a side note note, I can't tell the size of my folders because Google Drive doesn't have such option.

Hope to know the reason why and how to fix it. Thank you so much!

Task #1: Moving all json files from one folder to another folder on Google Drive. When I tested on smaller folders. All files are moved as expected. However, when used on "real folders" having much larger size, it looked as if it worked. No timeout. But when I looked at the folders on Google Drive, the files were still there. Nothing changed.

source = glob.glob('/path_to_source_folder/*.json')
destination = '/path_to_destination_folder/'

for json_file in source:
  id = os.path.basename(json_file)
  file = '/path_to_destination_folder/{}'.format(id)
  if os.path.exists(file):
    print('The file {} already exists'.format(id))
    os.remove(json_file)
  else:
    shutil.move(json_file, destination)

Task #2: Obtain statistics info of folders and json files. I tested on smaller folders and it worked well. Side note: the json files on smaller folders have the same structure with json files on larger files. When it comes to larger folders, it didn't timeout. It resulted in "0". Like "0 users", "0 posts", etc. These are definitely wrong.

files = glob.glob('/path_to_reference_folder/*.json')

total_users = 0
not_empty_users = 0
total_posts_by_users = []

for file in files:
  total_users += 1
  with open(file, 'r') as f:
    tmp = f.readlines()
    if len(tmp) > 0:
      not_empty_users += 1
    total_posts_by_users.append(len(tmp))

print("total {} users".format(total_users))

print("total {} posts by users".format(np.sum(total_posts_by_users)))
print("total {} users not empty".format(not_empty_users))
print("total {} average posts per users".format(np.mean(total_posts_by_users)))

Notes: Early steps - Mounting Drive and importing libraries

# Mounting Drive
from google.colab import drive
drive_mounting = drive.mount('/content/drive')

# Importing libraries
import numpy as np
import os
import glob
import json
import shut
LucyP
  • 31
  • 5
  • My first thought would be permission issues. – Frank Merrow Aug 16 '20 at 01:49
  • @Frank Merrow: Hi Frank, can you help clarify your thought? Thank you. I am running the codes on Google Colab and already mounted drive. Also, the folders are stored in my google drive. Especially, when I ran the codes for small folders, it ran well. However, when I ran on larger folders, it just didn't work as expected. I'm wondering if it has anything to do with memory or something... – LucyP Aug 18 '20 at 02:13
  • I guess my suggestion was a little cryptic. So you run on two test folders, probably folders you created. However, then you try to run on production folders . . . you call them larger folders . . . but my thought is that perhaps you don't have full permissions to either the input or output folder. I know nothing about Google Colab, but the other "next step" is to debug and then single-step through your code. As for memory, how many files are there in the production folder? If all those files were glob'bed as you show above; how many MegaBytes would that list of paths be? – Frank Merrow Aug 18 '20 at 02:56
  • How big is the whole script? Can you post it all? If not, can you make a minimal working exactly and post just that? – Frank Merrow Aug 18 '20 at 03:01
  • @Frank Merrow: Thank you for your sharing! I believe I have full permission to all the folders because I am the one who created them. Regarding the folders' size, I can't figure it out. You know Google Drive does not have the option to tell that. Since yesterday, I have been syncing all the folders needed for this from Google Drive to my computer but it hasn't been half way yet... – LucyP Aug 18 '20 at 05:00
  • @Frank Merrow: The main part of my scripts are already posted as above. The remaining are just the steps of importing libraries and mounting drive which is needed if you want to reference to a file or folder on drive. I've just added these parts into my post. – LucyP Aug 18 '20 at 05:09
  • @Frank Merrow: Hopefully after finishing the syncing, I can see the folders' size. Currently I can only tell that there are more than 3000 files in each folders. I'm not sure about the upper point. – LucyP Aug 18 '20 at 05:16
  • @ LucyP So you ignored my suggestion to use the debugger on this . . . so I'll make that suggestion again. For instance, what are the results of glob(), etc? A debug session would likely give you lots of answers for questions you don't even know you have. You might also try another routine like os.walk(). However, the bottom line here since you aren't getting errors is to debug this and see what is happening. – Frank Merrow Aug 18 '20 at 18:33
  • @Fran Merrow: Hi Frank, I'm sorry, I missed the "debugger" part. I'm a newbie in coding, so can you help recommend me how to debug? Thank you so much. Also, I did try the os.walk( ) routine, but things were the same. – LucyP Aug 22 '20 at 04:47
  • Nah, too much to learn. I've never used Google Colab, but look in the Google Colab docs/help . . . there has to be a section on debugging. It might be different for Google Colab, but the point of debugging is to put Python in a mode where the statements execute one at a time under your control and you can examine the results of each statement after it is executed. – Frank Merrow Aug 22 '20 at 04:49
  • @Frank Merrow: On a side note, I think the problem might have been something to do with google drive; therefore, I have been downloading all the working folders to my computer. The codes worked well when I used jupyter notebook to read the folders on my computer. Sofar I was only successful in downloading a few folders, I guess because some folders are very big; but I'm glad because now I know the solution, it's just the matter of time. – LucyP Aug 22 '20 at 04:54
  • @Frank Merrow: Anyway I will try debugging on Google Colab and keep you updated of the result. Thank you so much for your help so far! – LucyP Aug 22 '20 at 04:55
  • @Frank Merrow: Is there a way to upvote the comments? I'm new to stack overflow. I would like to vote for your comments. Thank you. – LucyP Aug 22 '20 at 04:56
  • @Frank Merrow: Looks like I don't have enough rep to upvote a comment. I will come back here to upvote you when I have enough rep. Have a nice weekend! – LucyP Aug 22 '20 at 05:09
  • @Frank Merrow: Just an update that with the time pressure, I ended up migrating all the files to my computer and ran the code in there. Everything is good now. Thanks. – LucyP Sep 11 '20 at 21:32

0 Answers0