Parsing SEC Edgar XML file using Ruby into Nokogiri

Question

I'm having problems parsing the SEC Edgar files

The end result is I want the stuff between <XML> and </XML> into a format I can access.

Here is my code so far that doesn't work:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)

"Doesn't work" is not very helpful. What doesn't work? What did you want to happen, and what happens instead? — Phrogz, May 01 '11 at 03:19

score 3 · Accepted Answer · answered Apr 30 '11 at 04:32

Ok, there are a couple of things wrong:

sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.

Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603

I feel like I shouldn't have to do gsub and instead match but this does work. Thanks. — hadees, May 16 '11 at 06:29
You shouldn't have to, but they created a file type that isn't XML. Your choice is to try to parse correctly without cleaning it up, or to clean it and have more predictable results. And, what is `match` supposed to accomplish for you? It only does what the `gsub` does. You'll be left with something needing to be parsed. Or, perhaps you don't understand what `match` does? — the Tin Man, May 16 '11 at 07:18

score 1 · Answer 2 · answered Apr 16 '12 at 20:34

Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm

Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml

Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

score 1 · Answer 3 · answered Apr 30 '11 at 02:56

I recommend practicing in IRB and reading the docs for Nokogiri

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>]

that should get you going

Parsing SEC Edgar XML file using Ruby into Nokogiri

3 Answers3