scraping pdf with empty fields in some lines

Question

I'm trying to get the CUSIP NO. and STATUS from this pdf. I only want the lines which have the field STATUS present("added" or "deleted").

The problem I currently have is that I don't know how to get both fields because STATUS field is not present most of the times and I can't find a way to know when a line ends and when the new one begins.

This is my code:

import scraperwiki
import urllib2
import lxml


TEST_URL = 'http://www.sec.gov/divisions/investment/13f/13flist2013q4.pdf'


def parse(pdf_url):
    u = urllib2.urlopen(pdf_url)
    #open the url for the PDF
    x = scraperwiki.pdftoxml(u.read())
    r = lxml.etree.fromstring(x) 
    cusips = r.xpath('//text[@left="110"]/text()') 
    issuer_desc = r.xpath('//text[@left="526"]/text()') 
    statuses = r.xpath('//text[@left="703"]')

This is a slice of the xml:

x[10000:12000]
'" font="6">D18190 95 8</text>\n<text top="258" left="226" width="9" height="12" 
font="6"> </text>\n<text top="258" left="251" width="144" height="12" font="6">DEUTSCHE BANK 
AG</text>\n<text top="258" left="526" width="27" height="12" font="6">PUT</text>\n<text
 top="283" left="110" width="102" height="12" font="6">G0083B 10 8</text>\n<text top="283"
 left="226" width="9" height="12" font="6">*</text>\n<text top="283" left="251" width="99"
 height="12" font="6">ACTAVIS PLC</text>\n<text top="283" left="526" width="27" height="12"
 font="6">SHS</text>\n<text top="283" left="703" width="45" height="12"
 font="6">ADDED</text>\n<text top="309" left="110" width="102" height="12" font="6">G0083B
 90 8</text>\n<text top="309" left="226" width="115" height="12" font="6"> ACTAVIS
 </text>\n<text top="309" left="323" width="27" height="12" font="6">PLC</text>\n<text
 top="309" left="526" width="36" height="12" font="6">CALL</text>\n<text top="309"
 left="703" width="45" height="12" font="6">ADDED</text>\n<text top="335" left="110"
 width="102" height="12" font="6">G0083B 95 8</text>\n<text top="335" left="226" width="115"
height="12" font="6"> ACTAVIS </text>\n<text top="335" left="323" width="27" height="12"
 font="6">PLC</text>\n<text top="335" left="526" width="27" height="12"
font="6">PUT</text>\n<text top="335" left="703" width="45" height="12"
   font="6">ADDED</text>\n<text top="361" left="110" width="102" height="12"
 font="6">G0129K 10 4</text>\n<text top="361" left="226" width="9" height="12" font="6">*
</text>\n<text top="361" left="251" width="117" height="12" font="6">AIRCASTLE
 LTD</text>\n<text top="361" left="526" width="27" height="12" font="6">COM</text>\n<text
 top="387" left="110" width="102" height="12" font="6">G0129K 90 4</text>\n<text top="387"
 left="226" width="133" height="12" font="6"> AIRCASTLE </text>\n<text top="387" left="341"
 width="27" height="12" font="6">LTD</text>\n<text top="387" left="526" width="36"
 height="12" font="6">CALL</text>\n<text top="413" left="110" width="102" height="12"'

Edit: The hackish way could be using xpath to get the and doing several getnext() function calls until the field where the STATUS should be, if the field is different from 'ADDED' or 'DELETED' then this CUSIP has no STATUS present.

Hackish solution function:

def parse(pdf_url):
    u = urllib2.urlopen(pdf_url)
    #open the url for the PDF
    x = scraperwiki.pdftoxml(u.read())
    xl = lxml.etree.fromstring(x) 
    cusips = xl.xpath('//text[@left="110"]/text()') 
    issuer_desc = xl.xpath('//text[@left="526"]/text()') 
    statuses = xl.xpath('//text[@left="703"]') 
    cusips_elements = xl.xpath('//text[@left="110"]')
    complete_cusips = []
    for ce in cusips_elements:
        try:
            posible_status = ce.getnext().getnext().getnext().getnext().text
        except:
            print "header or non cusip related line"
        if posible_status in ['ADDED', 'DELETED']:
            complete_cusips.append([ce.text, posible_status])

scraping pdf with empty fields in some lines

0 Answers0