Locate only the variable numeric data in filename in pandas column and place numbers in new column

Question

I am using this code to pull in two CSVs of a similar naming convention, place their filenames in a "File" column and concatenate the dataframes into one dataframe called NatHrs.

import glob
from pathlib import Path

path = r'C:\Users\ThisUser\Desktop\AC Mbr Analysis'
all_files = glob.glob(path + '\\Natl_hours_YTD_OC_*.csv')

Nat_dfs = []
for file in all_files:
    df = pd.read_csv(file, index_col=None, encoding='windows-1252', header=1 )
    df['File'] = file
    Nat_dfs.append(df)

NatHrs = pd.concat(Nat_dfs)

Now, I want to take the "File" column, which returns a filename object with entries looking like "C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019", extract ONLY the end of the filename--in this case "2018-2019"--and place those characters into a new colum "Program Year", reflecting the entry "2018-2019". I am not having success with manipulating strings or series--should I be using path.replace? I am lost. When I describe the column I'm looking to parse...

NatHrs['File'].describe

...I get this:

Name: File, dtype: object>

J.Doe · Answer 1 · 2020-02-04T15:35:44.530

You can use regular expressions for finding substrings in a string:

import re

string = r"C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019"
short = string.split('\\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

Might want to elaborate on how the pattern can vary. Will it always be "Year-Year"? Can it be just "Year"? Might have to change the regex then.

EDIT:

This was all rather unhandy so I made my own dummy files with which I can do what you want. Works fine for me, but look for yourself:

import glob
import pandas as pd
import re
import os

all_files = glob.glob('Natl_hours_YTD_OC_*.csv')
full_paths = [os.path.abspath(file) for file in all_files]

print(full_paths)

>>> Out:
['C:\\Users\\Chris\\Desktop\\Natl_hours_YTD_OC_2018-2019.csv',
'C:\\Users\\Chris\\Desktop\\Natl_hours_YTD_OC_2019-2020.csv',
'C:\\Users\\Chris\\Desktop\\Natl_hours_YTD_OC_2020.csv']

Nat_dfs = []

for file in all_files:
    df = pd.read_csv(file,delim_whitespace=True)
    print(df,'\n')
    df['File'] = file
    df['Year'] = re.search('\d+[-]*\d*',file).group()

    Nat_dfs.append(df)

NatHrs = pd.concat(Nat_dfs)
print(NatHrs)

>>> Out:
   A  B                             File       Year
0  7  7  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
1  8  8  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
2  9  9  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
0  4  4  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
1  5  5  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
2  6  6  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
0  1  1       Natl_hours_YTD_OC_2020.csv       2020
1  2  2       Natl_hours_YTD_OC_2020.csv       2020
2  3  3       Natl_hours_YTD_OC_2020.csv       2020

I don't know what you're doing wrong, but this definetly works. Hope that's what you need.

Thanks, your code works, but I need the substring to return the year associated with the varying filenames. This code only returns 2018-2019 for all rows. I tried this (see my edits above). — Hudson H., Feb 03 '20 at 16:43
hmmm... can you give me like a 10 rows excerpt from your dataframe so I can fiddle with it? — J.Doe, Feb 03 '20 at 19:20
Absolutely, see my edits above about the discrepancy (had to do a csv output due to length of File column). Please and thank you so much! — Hudson H., Feb 04 '20 at 01:41

Hudson H. · Answer 2 · 2020-02-04T01:40:36.750

I tried this:

import re

string = NatHrs['File']
short = string.split('\\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

NatHrs['Program Year'] = substring
NatHrs

And I got this:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-257-f8c37bb604e2> in <module>
      3 
      4 string = NatHrs['File']
----> 5 short = string.split('\\')[-1]
      6 
      7 substring = re.search('\d+[-]*\d+',short).group()

~\anaconda3\envs\PythonData\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'split'

This returned a file that read this type of inconsistency between File and Program Year years:

File    Program Year
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019

I also tried this:

import re

string = file
short = string.split('\\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

NatHrs['Program Year'] = substring
NatHrs

And got a column "Program Year" only reflecting 2019-2020, although I want both 2018-2019 and 2019-2020 to show up.

score 0 · Accepted Answer · answered Feb 14 '20 at 13:49

This ultimately ended up working. Thank you much for helping me along!

globbed_files = glob.glob("Natl_hours_YTD_OC_*.csv")
globbed_files

data = []
for csv in globbed_files:
    frame = pd.read_csv(csv, encoding='windows-1252', header=1)
    frame['filename'] = os.path.basename(csv)
    file = os.path.basename(csv)
#create a new column to store the portion of the file name that denotes the Program Year to which the data belongs
    frame['Program Year'] = re.search('\d+[-]*\d*',file).group()
    data.append(frame)

NatHrs = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
NatHrs.copy().head()

Locate only the variable numeric data in filename in pandas column and place numbers in new column

3 Answers3