I have a local directory on my machine with multiple html files, all with the following naming format
> XXXXXXXX_XXXX-XX-XX.html
with the X representing numeric characters (the number of numeric characters before the _ varies). I access all the files in the folder and then extract all the string in two css style classes ('font') and ('p style) based on a regex match (looking for the sub-string 'segment').
The output is a dataframe of all the extracted string content, e.g.:
 Like our Prescription Pharmaceuticals segment, the manufacturing of our Consumer Health products is competitive, with many established manufacturers engaged in all phases of the business. With the Company’s relatively small OTC [...]
I am in need of assistance to alter the output as follows:
- I would like to add another column to the dataframe output which looks up the numeric characters before the "_" in the filename. This way, I can match the string descriptions back to their individual html file source.
- As outlined in the output snippet above, I am getting various unicode mistakes which I'd like to remove. In earlier versions of the code, I tried to encode with utf-8 (in line 14, soup=...) but no luck.
Code below - any help would be appreciated, thanks.
import os
from bs4 import BeautifulSoup
from tqdm import tqdm
import re
import pandas as pd
import csv
rootdir = "C://directory//subdirectory"
segments_font=[]
segments_p_style=[]
for subdir, dirs, files in os.walk(rootdir):
for file in files:
filepath = subdir + os.sep + file
soup = BeautifulSoup(open(filepath))
for elem in tqdm(soup.find_all('font',text=re.compile(r'segment'))):
segments_font.append(elem)
for elem in tqdm(soup.find_all('p style',text=re.compile(r'segment'))):
segments_p_style.append(elem)
combined_list=list(set().union(segments_font,segments_p_style))
df=pd.DataFrame(data=combined_list,columns=['segments'])
df.to_csv('output.csv')