1

I am using ColdFusion (openBlueDragon) to insert the data from a large (200MB) xml file into a database without having to load the entire file into memory which is how I traditionally would do it. I did find a VERY SIMILAR QUESTION here: Looping over a large XML file that seems to be the answer I am looking for.

However, I'm not skilled enough in java to understand and adapt the code to my needs. I found no way to respond to the expert (@orangepips) who posted the code or else I would not have posted such a similar question.

My xml file looks like this:

 <allItems>
    <item>
        <subject>The subject text</subject>
        <date>2007-05-21 04:03:00</date>
        <content>text content often contains many paragraphs of text</content>
        <author>JPass78</author> 
    </item>
</allItems>

This is the code, courtesy orangepips, that I'm trying to adapt for my purpose. I've modified it a bit to include my own field names:

<cfset fis = createObject("java", "java.io.FileInputStream").init(
"#getDirectoryFromPath(getCurrentTemplatePath())#/file.xml")>
<cfset bis = createObject("java", "java.io.BufferedInputStream").init(fis)>
<cfset XMLInputFactory = createObject("java", "javax.xml.stream.XMLInputFactory").newInstance()>
<cfset reader = XMLInputFactory.createXMLStreamReader(bis)>

<cfloop condition="#reader.hasNext()#">
<cfset event = reader.next()>
<cfif event EQ reader.START_ELEMENT>
    <cfswitch expression="#reader.getLocalName()#">
        <cfcase value="allItems">
            <!--- root node, do nothing --->
        </cfcase>
        <cfcase value="item">
            <!--- set values used later on for inserts, selects, updates --->
        </cfcase>
        <cfcase value="subject">
            <!--- some selects and insert --->
        </cfcase>
        <cfcase value="contentdate">
            <!--- insert or update --->
        </cfcase>
        <cfcase value="content">
        </cfcase>
        <cfcase value="author">
         </cfcase>  
    </cfswitch>
</cfif>
</cfloop>
<cfset reader.close()>

I have a single table and I am trying to figure out how do I access the values from each XML element so I may insert it one row at a time? like this: INSERT INTO content (subject,contentdate, content, author) VALUES ("The subject text", 2007-5-21 04:03:00, "text content here","JPass78");

Community
  • 1
  • 1
JPass
  • 33
  • 1
  • 8

2 Answers2

1

One possibility is to initialize a data structure each time you encounter the <item> element. As the child elements go by (<subject>, <date>, ...), extract their text and add it to your structure. Then when you reach the </item> element do your validation/insert. There may be better approaches. But that should give you something to work with ..

Update: I had a hunch a database bulk loading tool would be a better option. Turns out it was ;) See JPass' answer for details.

<cfset fis = createObject("java", "java.io.FileInputStream").init(pathToYourFile)>
<cfset bis = createObject("java", "java.io.BufferedInputStream").init(fis)>
<cfset XMLInputFactory = createObject("java", "javax.xml.stream.XMLInputFactory").newInstance()>
<cfset reader = XMLInputFactory.createXMLStreamReader(bis)>

<cfloop condition="#reader.hasNext()#">
    <cfset event = reader.next()>
    <cfif event EQ reader.START_ELEMENT>
        <cfswitch expression="#reader.getLocalName()#">
            <cfcase value="item">
                <!--- start a new data row --->
                <cfset row = {}>
            </cfcase>
            <cfcase value="subject">
                <!--- extract the subject text --->
                <cfset row.subject = reader.getElementText()>
            </cfcase>
            <cfcase value="date">
                <!--- extract the date text --->
                <cfset row.date = reader.getElementText()>
            </cfcase>
            <cfcase value="content">
                <!--- extract the content text --->
                <cfset row.content = reader.getElementText()>
            </cfcase>
            <cfcase value="author">
                <!--- extract the author text --->
                <cfset row.author = reader.getElementText()>
            </cfcase>  
        </cfswitch>
    <cfelseif event EQ reader.END_ELEMENT>
        <!--- we have reached the end of the row. time to insert the data --->
        <cfif reader.getLocalName() eq "item">
            <cfdump var="#row#" label="Debug Row Data">    
            <!--- ... validate / insert "row" data into database --->
        </cfif>
    </cfif>
</cfloop>

<cfset fis.close()>
<cfset reader.close()>
Leigh
  • 28,765
  • 10
  • 55
  • 103
  • Thanks Leigh, this worked well. It's not as fast as I'd hoped but I am testing on my local computer which slows it down a bit I bet. It's taking about 5mins. for a xml file with 24 fields per item and 1,000 items. I think I will attempt to create a single query maybe and see if that speeds things up. My XML file will probably have 10s of 1,000s of records. Thanks! – JPass Oct 17 '11 at 15:17
  • Well, looping is time consuming any way you slice it. Also, databases have limits on how much sql they can process at one time. So you may not be able to do everything in one (1) query. But breaking it into chunks, to reduce the number of database hits, should help. Also does your database support bulk loading of `xml` instead? MS SQL does. Its bulk loading tool is a `COM` object. But it is pretty fast. I am also wondering if it would be better to transform the data into smaller flat files. As most databases have tools for bulk importing text files. That may be a crazy though.. – Leigh Oct 17 '11 at 16:35
  • Yes! My rationale for using CFML was that I could use logic to determine what fields were present in my export file before inserting. Although the field names are always the same, I can't know which fields will be in xml file. MYSQL "LOAD XML INFILE" statement does what I need in .62 seconds for 1000 rows. In case anyone else is needs a similar solution: via the MYSQL command line: LOAD XML INFILE 'pathtofile/file.xml' INTO TABLE table_name ROWS IDENTIFIED BY ''; My field names between the tag match my database field names and it won't matter which fields are in the export file. – JPass Oct 17 '11 at 17:53
  • Great! I figured database import tools would be a lot faster. But that is even better than I expected ;) You should write that up as an answer and accept it. – Leigh Oct 17 '11 at 18:48
1

Instead of using COLDFUSION to import large XML files into a MYSQL database, use the MYSQL command "LOAD XML INFILE".

Here's the simple, light and fast code that worked for me:

LOAD XML INFILE 'pathtofile/file.xml' INTO TABLE table_name ROWS IDENTIFIED BY '<item>';

My xml file uses the same exact field names as my database table. ROWS IDENTIFIED BY tells the command that the field names in my xml file will correspond to the database fields in my table and they will be found in between the <item></item> tags.

FYI, <item> is my own naming format. Your file will likely have another tag name that relates to the data you're working with. For example, if your xml file is for employee data, you might instead use <employee>

Available in MYSQL5.5 - Reference for LOAD XML INFILE can be found at: http://dev.mysql.com/doc/refman/5.5/en/load-xml.html

JPass
  • 33
  • 1
  • 8