5

I have a list of xml and a for loop that flattens the xml into a pandas dataframe.

The for loop works perfectly fine but is taking very long to flatten the xml, which is getting larger as time goes on.

How do I wrap the below for-loop in executor.map to spread the work load among different cores? I am following this article https://medium.com/@ageitgey/quick-tip-speed-up-your-python-data-processing-scripts-with-process-pools-cf275350163a

for loop to flatten xml:

df1 = pd.DataFrame()
for i in lst:
    print('i am working')
    soup = BeautifulSoup(i, "xml")
    # Get Attributes from all nodes
    attrs = []
    for elm in soup():  # soup() is equivalent to soup.find_all()
        attrs.append(elm.attrs)

    # Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
    fields_attribute_list= [x for x in attrs if 'Id' in x.keys()]
    other_attribute_list = [x for x in attrs if 'Id' not in x.keys() and x != {}]

    # Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
    attribute_dict = {}
    for d in other_attribute_list:
        for k, v in d.items():  
            attribute_dict.setdefault(k, v)

    # Update each field row with attributes from all other nodes.
    full_list = []
    for field in fields_attribute_list:
        field.update(attribute_dict)
        full_list.append(field)

    # Make Dataframe
    df = pd.DataFrame(full_list)
    df1 = df1.append(df)

Does the for loop need to be transformed into a function?

RustyShackleford
  • 3,462
  • 9
  • 40
  • 81

1 Answers1

3

Yes, you do need to transform the loop into a function. The function has to be able to take in just one argument. That one argument could be anything such as a list,tuple,dictionary or whatever. Functions with multiple parameters are a little complex to put into the concurrent.futures.*Executor methods.

This example below should work for you.

from bs4 import BeautifulSoup
import pandas as pd
from concurrent import futures


def create_dataframe(xml):
    soup = BeautifulSoup(xml, "xml")
    # Get Attributes from all nodes
    attrs = []
    for elm in soup():  # soup() is equivalent to soup.find_all()
        attrs.append(elm.attrs)

    # Since you want the data in a dataframe, it makes sense for each field to be a new row consisting of all the other node attributes
    fields_attribute_list = [x for x in attrs if 'FieldId' in x.keys()]
    other_attribute_list = [x for x in attrs if 'FieldId' not in x.keys() and x != {}]

    # Make a single dictionary with the attributes of all nodes except for the `Field` nodes.
    attribute_dict = {}
    for d in other_attribute_list:
        for k, v in d.items():
            attribute_dict.setdefault(k, v)

    # Update each field row with attributes from all other nodes.
    full_list = []
    for field in fields_attribute_list:
        field.update(attribute_dict)
        full_list.append(field)
    print(len(full_list))
    # Make Dataframe
    df = pd.DataFrame(full_list)
    # print(df)
    return df


with futures.ThreadPoolExecutor() as executor:  # Or use ProcessPoolExecutor
    df_list = executor.map(create_dataframe, lst)

df_list = list(df_list)
full_df = pd.concat(list(df_list))
print(full_df)
BoreBoar
  • 2,619
  • 4
  • 24
  • 39
  • Thank you for the answer, it works however when I run the for loop by itself I get 5066 rows with 57 columns, if I use your function I get 159 rows and 57 columns. I have 159 xml objects in the list to unpack. Cant figure out what is causing the delta. – RustyShackleford Aug 23 '18 at 12:47
  • Well, I went and checked back to the old answer that i had given. Turns out that in the question that you've posted you've got the line `fields_attribute_list= [x for x in attrs if 'Id' in x.keys()]` and you have `Id` instead of `FieldId` in it. That results in the xml just giving out 1 result. When you substitute `Id` with `FieldId` on the xml you've provided earlier, it seems to get you the answer that you require. I've updated the answer to reflect that. Check and tell me if it works? – BoreBoar Aug 23 '18 at 13:23
  • 1
    beautiful, I forgot to change that as well. – RustyShackleford Aug 23 '18 at 13:24