I'm trying to use groupby on a pandas data structure containing pathlib information and file sizes from a particular drive. I want to total up the storage used at a particular depth of the file directory structure to see which directories are the most full.
I was trying to do a summation groupby on the Pathlib Parent value for each file but that still doesn't tell you what your total storage is at a particular depth. Pathlib "Parents" looked promising but it starts with the full path and works backwards, so I tried reverse indexing but it doesn't seem to work.
From what I read in the documentation Pathlib Parents are supposed to be sequences, which are supposed to support reverse indexes, but the error messages seem to imply they don't do negatives.
Here is the code I've been using (with help from http://pbpython.com/pathlib-intro.html)
import pandas as pd
from pathlib import Path
import time
dir_to_scan = "c:/Program Files"
p = Path(dir_to_scan)
all_files = []
for i in p.rglob('*.*'):
all_files.append((i.name, i.parent,i.stat().st_size))
columns = ["File_Name", "Parent", "Size"]
df = pd.DataFrame.from_records(all_files, columns=columns)
df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-2] )
The error trace is as follows:
IndexError Traceback (most recent call last)
<ipython-input-3-5748b1f0a9ee> in <module>()
1 #df.groupby('Parent')['Size'].sum()
2
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
4
5 #df([apps])=df([Parent]).parents
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2549 else:
2550 values = self.asobject
-> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-3-5748b1f0a9ee> in <lambda>(x)
1 #df.groupby('Parent')['Size'].sum()
2
----> 3 df["path_stem"]=df['Parent'].apply(lambda x: x.parent if len(x.parents)<3 else x.parents[-1] )
4
5 #df([apps])=df([Parent]).parents
C:\ProgramData\Anaconda3\lib\pathlib.py in __getitem__(self, idx)
592 def __getitem__(self, idx):
593 if idx < 0 or idx >= len(self):
--> 594 raise IndexError(idx)
595 return self._pathcls._from_parsed_parts(self._drv, self._root,
596 self._parts[:-idx - 1])
IndexError: -1