1

I'm having problems parsing the SEC Edgar files

Here is an example of this file.

The end result is I want the stuff between <XML> and </XML> into a format I can access.

Here is my code so far that doesn't work:

scud = open("http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt")
full = scud.read
full.match(/<XML>(.*)<\/XML>/)
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
hadees
  • 1,754
  • 2
  • 25
  • 36
  • "Doesn't work" is not very helpful. What doesn't work? What did you want to happen, and what happens instead? – Phrogz May 01 '11 at 03:19

3 Answers3

3

Ok, there are a couple of things wrong:

  1. sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt is NOT XML, so Nokogiri will be of no use to you unless you strip off all the garbage from the top of the file, down to where the true XML starts, then trim off the trailing tags to keep the XML correct. So, you need to attack that problem first.
  2. You don't say what you want from the file. Without that information we can't recommend a real solution. You need to take more time to define the question better.

Here's a quick piece of code to retrieve the page, strip the garbage, and parse the resulting content as XML:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(
  open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt').read.gsub(/\A.+<xml>\n/im, '').gsub(/<\/xml>.+/mi, '')
)
puts doc.at('//schemaVersion').text
# >> X0603
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • I feel like I shouldn't have to do gsub and instead match but this does work. Thanks. – hadees May 16 '11 at 06:29
  • You shouldn't have to, but they created a file type that isn't XML. Your choice is to try to parse correctly without cleaning it up, or to clean it and have more predictable results. And, what is `match` supposed to accomplish for you? It only does what the `gsub` does. You'll be left with something needing to be parsed. Or, perhaps you don't understand what `match` does? – the Tin Man May 16 '11 at 07:18
1

Given this was asked a year back, the answer is probably OBE, but what the fellow should do is examine all of the documents that are on the site, and notice the actual filing details can be found at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/0001475481-09-000001-index.htm

Within this, you will see that the XML document is is after is already parsed out ready for further manipulation at:

http://sec.gov/Archives/edgar/data/1475481/000147548109000001/primary_doc.xml

Be warned, however, the actual file name at the end is determined by the submitter of the document, not by the SEC. Therefore, you cannot depend on the document always being 'primary_doc.xml'.

Mark
  • 11
  • 1
1

I recommend practicing in IRB and reading the docs for Nokogiri

> require 'nokogiri'
=> true
> require 'open-uri'
=> true
> doc = Nokogiri::HTML(open('http://sec.gov/Archives/edgar/data/1475481/0001475481-09-000001.txt'))
> doc.xpath('//firstname')
=> [#<Nokogiri::XML::Element:0x80c18290 name="firstname" children=[#<Nokogiri::XML::Text:0x80c18010 "Joshua">]>, #<Nokogiri::XML::Element:0x80c14d48 name="firstname" children=[#<Nokogiri::XML::Text:0x80c14ac8 "Patrick">]>, #<Nokogiri::XML::Element:0x80c11fd0 name="firstname" children=[#<Nokogiri::XML::Text:0x80c11d50 "Brian">]>] 

that should get you going

radixhound
  • 2,190
  • 2
  • 18
  • 23