1

Need some help with this. We have a large XML files from an API that I need to convert into a SQL Server table. Using pyodbc to load the table into SQL is no problem but I'm struggling to parse the 16mb XML file. The file is well formed - it is wrapped inside a single root. I've been looking for a solution for weeks and haven't been able to trial-and-error my way through it. Any assistance is appreciated.

In short, the goal is to output all values contained in the XML file in tabular format where each record is listed in its' own row. The Address and Space sections of the XML require proper handling for the sub elements. I've got code that works but I don't think this is the "proper" way to parse the values:

import xml.etree.ElementTree as ET
tree = ET.parse('Z:\\FileLoc\\DealData.xml')
root = tree.getroot()
    for prop in deals[0]:
    for addr1 in prop[4].iter(tag="Street"):
        for addr2 in prop[4].iter(tag="City"):
            for addr3 in prop[4].iter(tag="State"):
                for addr4 in prop[4].iter(tag="PostalCode"):
                    for addr5 in prop[4].iter(tag="Country"):
                        for spaces in prop[5]:
                                print(deals.attrib.get("DealID"), prop.attrib.get("ID"),
                                   prop[0].text[0:5], prop[0].text[6:],
                                   addr1.text, addr2.text, addr3.text, addr4.text, addr5.text,
                                   spaces.attrib.get("ID"))

Here's one example of the XML I'm working with:

    <?xml version="1.0" encoding="UTF-8"?>
<DealExport>
    <jobid>8558e207709c0730</jobid>
    <Deals>
        <Deal DealID="706880">
            <Properties>
                <Property ID="69320">
                    <Name>41381_BuildingName - Bldg 1</Name>
                    <CreatedAt>2015-10-28T04:09:07Z</CreatedAt>
                    <UpdatedAt>2017-12-20T01:09:51Z</UpdatedAt>
                    <Symbols>
                        <Type>SourceID</Type>
                        <Value>41381~41381</Value>
                    </Symbols>
                    <Address>
                        <City>Calgary</City>
                        <State>AB</State>
                        <Street>1234 Five Street</Street>
                        <PostalCode>H0H 0H0</PostalCode>
                        <Country>Canada</Country>
                    </Address>
                    <Spaces>
                        <Space ID="848294">
                            <Suite>800</Suite>
                            <SpaceAvailable>11225</SpaceAvailable>
                            <FloorName>8</FloorName>
                            <FloorPosition>7</FloorPosition>
                            <FloorRentableArea/>
                            <CreatedAt>2016-04-20T13:47:13Z</CreatedAt>
                            <UpdatedAt>2017-10-02T09:17:20Z</UpdatedAt>
                            <Symbols>
                                <Type>SourceID</Type>
                                <Value>800</Value>
                            </Symbols>
                        </Space>
                    </Spaces>
                </Property>
            </Properties>
            <DealType>new</DealType>
            <LastModified>2017-12-20T22:36:21Z</LastModified>
            <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
            <Stage>dead_deal</Stage>
            <Probability>0</Probability>
            <CompetitiveSet/>
            <MoveInDate>2017-12-01</MoveInDate>
            <Tenant>
                <Name>Current Occupant Ltd.</Name>
                <Industry>financial services</Industry>
                <Affiliation/>
                <TenantID>1013325</TenantID>
                <TenantWebsite/>
                <TenantCurrentLocation/>
                <TenantRSFFrom>6000</TenantRSFFrom>
                <TenantRSFTo>7000</TenantRSFTo>
            </Tenant>
            <Brokers></Brokers>
            <TenantContacts></TenantContacts>
            <DealStages>
                <DealStage ID="1533265">
                    <Stage>proposal</Stage>
                    <CreatedDate>2017-06-06T15:09:35Z</CreatedDate>
                    <LastModified>2017-06-06T15:09:22Z</LastModified>
                    <StartDate>2017-06-06T15:09:22Z</StartDate>
                    <DurationInDays>197</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>17699</UserId>
                        <FirstName>Person</FirstName>
                        <LastName>Last-Name</LastName>
                        <Email>employee@email.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174388">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:01Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:01Z</LastModified>
                    <StartDate>2017-12-20T22:36:01Z</StartDate>
                    <DurationInDays>0</DurationInDays>
                    <DeadDealReasons></DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
                <DealStage ID="2174390">
                    <Stage>dead_deal</Stage>
                    <CreatedDate>2017-12-20T22:36:21Z</CreatedDate>
                    <LastModified>2017-12-20T22:36:21Z</LastModified>
                    <StartDate>2017-12-20T22:36:21Z</StartDate>
                    <DurationInDays>99</DurationInDays>
                    <DeadDealReasons>
                        <DeadDealReason>price</DeadDealReason>
                    </DeadDealReasons>
                    <User>
                        <UserId>24934</UserId>
                        <FirstName>Agent</FirstName>
                        <LastName>Lastname</LastName>
                        <Email>agent@broker.com</Email>
                        <Phone></Phone>
                        <PhoneExtension></PhoneExtension>
                        <Title></Title>
                        <BrokerLicenseNumber></BrokerLicenseNumber>
                    </User>
                </DealStage>
            </DealStages>
            <DealComments></DealComments>
            <DealTerms>
                <DealTerm ID="4278580">
                    <ProposalType>landlord</ProposalType>
                    <DiscountRate>0.08</DiscountRate>
                    <RentableArea>6200</RentableArea>
                    <LeaseType>triple net</LeaseType>
                    <LeaseTerm>60</LeaseTerm>
                    <CommencementDate>2018-01-01</CommencementDate>
                    <TenantImprovements></TenantImprovements>
                    <BuildingImprovements></BuildingImprovements>
                    <FreeRents></FreeRents>
                    <CreatedDate>2017-06-06T15:16:23Z</CreatedDate>
                    <SecurityDeposit/>
                    <NER>20714.1144670763</NER>
                    <NEROverride/>
                    <DateEntered>2017-06-06</DateEntered>
                    <OpEx>
                        <BaseOpEx/>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </OpEx>
                    <RealEstateTaxes>
                        <BaseRealEstateTaxes>7.45</BaseRealEstateTaxes>
                        <BaseYear/>
                        <YearType/>
                        <LeaseType/>
                        <Description/>
                    </RealEstateTaxes>
                    <NPVperSqFt>103367.15936951</NPVperSqFt>
                    <TotalNPV>640876388.09096</TotalNPV>
                    <MiscDescription/>
                    <NetCashFlow>642137720.0</NetCashFlow>
                    <TotalIncome>502200.0</TotalIncome>
                    <Concessions>64480.0</Concessions>
                    <Payback>1</Payback>
                    <IRR/>
                    <Latest>true</Latest>
                    <BaseRents>
                        <BaseRent ID="45937180">
                            <BeginIn>1</BeginIn>
                            <Rent>15.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>36</Duration>
                        </BaseRent>
                        <BaseRent ID="45937181">
                            <BeginIn>37</BeginIn>
                            <Rent>18.0</Rent>
                            <Period>rsf/year</Period>
                            <Duration>24</Duration>
                        </BaseRent>
                    </BaseRents>
                    <RentEscalations></RentEscalations>
                    <OtherCredits></OtherCredits>
                    <Commissions>
                        <Commission ID="260593">
                            <RepType>landlord</RepType>
                            <Firm>LandLord</Firm>
                            <Amount>2.9</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                        <Commission ID="260594">
                            <RepType>landlord</RepType>
                            <Firm>Payee</Firm>
                            <Amount>7.5</Amount>
                            <Unit>rsf</Unit>
                        </Commission>
                    </Commissions>
                    <Rights></Rights>
                    <ProposalID>186607</ProposalID>
                    <ProposalEnteredDate>2017-06-06</ProposalEnteredDate>
                    <RecoveryIncomes>
                        <RecoveryIncome>
                            <RecoveryType>ReimbursableExpenseCam</RecoveryType>
                            <ExpenseAmount>12.87</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>12.87</RecoveryAmount>
                        </RecoveryIncome>
                        <RecoveryIncome>
                            <RecoveryType>RealEstateTax</RecoveryType>
                            <ExpenseAmount>7.45</ExpenseAmount>
                            <RecoveryMethod>net</RecoveryMethod>
                            <RecoveryAmount>7.45</RecoveryAmount>
                        </RecoveryIncome>
                    </RecoveryIncomes>
                </DealTerm>
            </DealTerms>
            <Budgets></Budgets>
            <Appraisals></Appraisals>
            <Tours></Tours>
        </Deal>
Daryl Heppner
  • 43
  • 1
  • 8
  • 1
    What do you need parsed? What is the structure of table? What did you try? – Parfait Apr 01 '18 at 12:55
  • 1
    16Mb is not really that large. there's no point telling us you are struggling: you need to tell us what you are doing and what the failure symptoms are. – Michael Kay Apr 01 '18 at 16:34
  • _parsing_ a 16MB XML file shouldn't take that long. _printing_ a 16MB XML file may take a while. change that `print` to `pass` and you will get a better idea of the time for the parsing alone. using integer offsets (`prop[0]`) is less accurate than looking by tag (`prop.findall('Address')`), but otherwise it seems fine. – Corley Brigman Apr 02 '18 at 15:45

1 Answers1

4

For the best of my knowledge, there is no problem working with large files in XML. You can use xml.etree.ElementTree for your data. I tested the following code and it works (parsing and printing):

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

def deepParsing(node, pre=None):
    if pre is None:
        pre = '-'
    print(' ')
    print(pre + '> Tag:',node.tag)
    print(pre + '> Attributes:', node.attrib)
    print(pre + '> Text:', node.text)

    pre += '-'
    for key in node:
        deepParsing(key, pre)
deepParsing(root)

Your data was printed on console according to the given tree.

Your example is missing closing tags:

    </Deals>
</DealExport>

at the end of the xml.

Dharman
  • 30,962
  • 25
  • 85
  • 135
mr_mo
  • 1,401
  • 6
  • 10