0
<DIV align="center" style="margin-left: 0%; margin-right: 0%; font-size: 10pt; font-family: Arial, Helvetica; color: #000000; background: #FFFFFF">

<A name='123'></A><B><FONT style="font-family: 'Times New Roman', Times">DIRECTOR
COMPENSATION</FONT></B>  </DIV>

Hi I am scraping information from proxy statements downloaded from SEC EDGAR. I am wondering how to locate above through the string inside "DIRECTOR COMPENSATION" with beautiful soup? I am trying to make a generalized codes for other webpages like this one, so I have to rely on the keyword.

Many thanks!

1 Answers1

1

This should get all tags with that contain 'DIRECTOR COMPENSATION'

tags = [ tag for tag in soup.find_all() if 'DIRECTOR COMPENSATION' in tag.text ]

If you want to use regular expressions for the tag.text :

tags = [ 
    tag for tag in soup.find_all('div') 
    if re.search('DIRECTOR\s+COMPENSATION', tag.text, flags=re.IGNORECASE) 
]

If you want to use a list of keywords :

tags = [ 
    tag for tag in soup.find_all('table') 
    if any( re.search(k, tag.text, flags=re.IGNORECASE) for k in ('regex 1', 'regex 2' ) ) 
]
t.m.adam
  • 15,106
  • 3
  • 32
  • 52
  • Great, thank you, this is very helpful! Further question, (1)I realize tag.text only works for direct children, is there any way that I can find all
    as long as their descendants contain these key words? (2)what shall I do if I want to include variations in the key words: for example when I do below, it doesn't work. tags = [ tag for tag in soup.find_all('b') if re.compile("director[\n\s]* compensation table", re.IGNORECASE)) in tag.text ]. I really appreciate your help!!!!
    – GreenOnion May 01 '17 at 04:26
  • It should work for tags in 'div' , eg : '

    DIRECTOR COMPENSATION

    '
    – t.m.adam May 01 '17 at 04:34
  • Thanks for the quick reply! My problem is that the words are in the , that doesn't work well... – GreenOnion May 01 '17 at 04:38
  • Why not ? can you give me an example html ? – t.m.adam May 01 '17 at 04:44
  • Sure, I really appreciate your time in helping! Here is the link. https://www.sec.gov/Archives/edgar/data/881890/000095013407020029/f33356dedef14a.htm#123 My ultimate goal is to use "DIRECTOR COMPENSATION" to locate this table. – GreenOnion May 01 '17 at 04:47
  • The table is not included in the same div as 'DIRECTOR COMPENSATION' , it's 2 divs below . Can't you use any other keywords ( 'style' , 'align', etc ) ? – t.m.adam May 01 '17 at 05:02
  • Sorry, I may have too many questions....Is there any way to locate a table by simply searching for the keywords? Say, I want to find a table that has "a" "b" "c" "d", these words may be in the different cells of the table?Thanks a billion again for your help!!!! – GreenOnion May 01 '17 at 05:03
  • Yes you can , use the same formula , but change 'div' to 'table' . I 've added a simple regex example in my post , you could build on that . I 'll update the code to add an example of multiple keywords – t.m.adam May 01 '17 at 05:09
  • Actually I have many pages like this html, all the pages have similar tables as the one I show you. My final goal is to have a generalized code to extract information from these tables of these pages. Apparently every firm has its own way in reporting, so format of the table won't be too reliable. My basic strategy is to use a key word ('director compensation' here, the phrase is almost always locate right before this kind of table) to locate a general area, and then use findNext ('table') to locate the table. – GreenOnion May 01 '17 at 05:11
  • Thank you so much for your help, you are indeed soooooooo helpful!!! Other similar pages may look like https://www.sec.gov/Archives/edgar/data/1002910/000119312510052422/ddef14a.htm – GreenOnion May 01 '17 at 05:12
  • Updated the code to match multiple keywords , i hope this gives you some ideas – t.m.adam May 01 '17 at 05:26
  • It looks great!!!! My understanding is that the output will be tables with at least one of these key words, right? What if I want tables that have all those key words? Thanks again!!!!! – GreenOnion May 01 '17 at 05:39
  • Yes , glad i could help – t.m.adam May 01 '17 at 06:06
  • Sorry, one last question. Is there anyway that I can identify the table that includes all the information I need, instead of at least one piece information I listed, such as including all of "a", "b","c", "d"? Thanks again! – GreenOnion May 01 '17 at 13:47
  • I figured the last question out, thank you!!! basically replace 'any' with 'all'. Thanks a billion for your help! You save me so much time!!!!! – GreenOnion May 01 '17 at 14:31