I have a list of xml and a for loop that flattens the xml into a pandas dataframe.
The for loop works perfectly fine but is taking very long to flatten the xml, which is getting larger as time goes on.
How do I wrap the below for-loop in executor.map
to spread the work load among different cores? I am following this article https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a
for loop to flatten xml:
df1 = pd.DataFrame()
for i in lst:
print('i am working')
soup = BeautifulSoup(i, "xml")
# Get Attributes from all nodes
attrs = []
for elm in soup(): # soup() is equivalent to soup.find_all()
attrs.append(elm.attrs)
# Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
fields_attribute_list= [x for x in attrs if 'Id' in x.keys()]
other_attribute_list = [x for x in attrs if 'Id' not in x.keys() and x != {}]
# Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
attribute_dict = {}
for d in other_attribute_list:
for k, v in d.items():
attribute_dict.setdefault(k, v)
# Update each field row with attributes from all other nodes.
full_list = []
for field in fields_attribute_list:
field.update(attribute_dict)
full_list.append(field)
# Make Dataframe
df = pd.DataFrame(full_list)
df1 = df1.append(df)
Does the for loop need to be transformed into a function?