0

I'm trying to use groupby on a pandas data structure containing pathlib information and file sizes from a particular drive. I want to total up the storage used at a particular depth of the file directory structure to see which directories are the most full.

I was trying to do a summation groupby on the Pathlib Parent value for each file but that still doesn't tell you what your total storage is at a particular depth. Pathlib "Parents" looked promising but it starts with the full path and works backwards, so I tried reverse indexing but it doesn't seem to work.

From what I read in the documentation Pathlib Parents are supposed to be sequences, which are supposed to support reverse indexes, but the error messages seem to imply they don't do negatives.

Here is the code I've been using (with help from http://pbpython.com/pathlib-intro.html)

import pandas as pd
from pathlib import Path
import time

dir_to_scan = "c:/Program Files"
p = Path(dir_to_scan)

all_files = []
for i in p.rglob('*.*'):
    all_files.append((i.name, i.parent,i.stat().st_size))

columns = ["File_Name", "Parent", "Size"]
df = pd.DataFrame.from_records(all_files, columns=columns)

df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-2] )

The error trace is as follows:

IndexError                                Traceback (most recent call last)
<ipython-input-3-5748b1f0a9ee> in <module>()
      1 #df.groupby('Parent')['Size'].sum()
      2 
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
      4 
      5 #df([apps])=df([Parent]).parents

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2549             else:
   2550                 values = self.asobject
-> 2551                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2552 
   2553         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-3-5748b1f0a9ee> in <lambda>(x)
      1 #df.groupby('Parent')['Size'].sum()
      2 
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
      4 
      5 #df([apps])=df([Parent]).parents

C:\ProgramData\Anaconda3\lib\pathlib.py in __getitem__(self, idx)
    592     def __getitem__(self, idx):
    593         if idx < 0 or idx >= len(self):
--> 594             raise IndexError(idx)
    595         return self._pathcls._from_parsed_parts(self._drv, self._root,
    596                                                 self._parts[:-idx - 1])

IndexError: -1
  • Could you post your error trace? – PMende Aug 16 '18 at 04:34
  • Would you also mind limiting you code to a minimal example? `all_files.append((i.name, i.parent,time.ctime(i.stat().st_ctime), i.stat().st_size))` - a bit had to go through and mostly irrelavant ot what you are asking about – Evgeny Aug 16 '18 at 11:40
  • added error trace and removed the file "created date" from the dataframe to simplify. – Derek Plansky Aug 16 '18 at 15:19

0 Answers0