Read local html files and convert to dataframe with python

Question

I have a local directory on my machine with multiple html files, all with the following naming format

> XXXXXXXX_XXXX-XX-XX.html

with the X representing numeric characters (the number of numeric characters before the _ varies). I access all the files in the folder and then extract all the string in two css style classes ('font') and ('p style) based on a regex match (looking for the sub-string 'segment').

The output is a dataframe of all the extracted string content, e.g.:

Â Like our Prescription Pharmaceuticals segment, the manufacturing of our Consumer Health products is competitive, with many established manufacturers engaged in all phases of the business. With the Companyâ€™s relatively small OTC [...]

I am in need of assistance to alter the output as follows:

I would like to add another column to the dataframe output which looks up the numeric characters before the "_" in the filename. This way, I can match the string descriptions back to their individual html file source.
As outlined in the output snippet above, I am getting various unicode mistakes which I'd like to remove. In earlier versions of the code, I tried to encode with utf-8 (in line 14, soup=...) but no luck.

Code below - any help would be appreciated, thanks.

import os
from bs4 import BeautifulSoup
from tqdm import tqdm
import re
import pandas as pd
import csv

rootdir = "C://directory//subdirectory"

segments_font=[]
segments_p_style=[]

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        filepath = subdir + os.sep + file
        soup = BeautifulSoup(open(filepath))
        for elem in tqdm(soup.find_all('font',text=re.compile(r'segment'))):
            segments_font.append(elem)
        for elem in tqdm(soup.find_all('p style',text=re.compile(r'segment'))):
            segments_p_style.append(elem)
    combined_list=list(set().union(segments_font,segments_p_style))

    df=pd.DataFrame(data=combined_list,columns=['segments'])
    df.to_csv('output.csv')

Read local html files and convert to dataframe with python

0 Answers0