2

I am trying to parse this site https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017

using the following code

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import ssl
context = ssl._create_unverified_context()
dibbsurl = 'https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017'
uClient = uReq(dibbsurl, context=context)
dibbshtml = uClient.read()
uClient.close()

#html parser
dibbssoup = soup(dibbshtml, "html.parser")

#grabs each rfq
containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

I want to grab the National Stock Numbers, the Nomenclature and QTY from the table for research purposes.

containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

I was trying to grab each row of the table but containers does not seem to be grabing it. when I type len(containers) it shows 0 why is the table not being grabbed and how can I fix it?

update this is the sample html from the site

<tr class="BgWhite">
    <td headers="th0" valign="top">
        1
    </td>
    <td headers="th1" style="width: 125px;" valign="top">
        <a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQNsn.aspx?value=8465015550093&amp;category=issue&amp;Scope=" title="go to NSN view">8465-01-555-0093</a>
    </td>
    <td headers="th2" valign="top">
        SNAP LINK, RAPPELLER
    </td>
    <td headers="th3" valign="top">
        None
    </td>
    <td headers="th4" style="width: 150px;" valign="top">
        <a href="https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE1C117T2608.PDF" title="RFQ document" target="DIBBSDocuments">SPE1C1-17-T-2608</a><br>&nbsp;&nbsp;<span style="font-size: 9px; color: #505050;">» <a href="https://www.dibbs.bsm.dla.mil/rfq/rfqrec.aspx?sn=SPE1C117T2608" title="Package View" class="SubMenuLink">Package View</a></span><a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQQHlp.aspx?ht=fi"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconFastPace.gif" alt="Fast Award Candidate.  Micro-purchase quotes may be awarded prior to the solicitation return date.  See Master Solicitation for Additional Info" width="14" height="11" hspace="0" border="0" align="middle"></a><br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconEproc.gif" width="36" height="16" hspace="1" border="0" alt="DLA E-Procurement" style="border-width:0px;  vertical-align: bottom;">
    </td>
    <td headers="th5" valign="top">
        <span style="color:#000099">Open</span><br><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/buttons/btnQ.gif" width="18" height="18" border="0" alt="Click to submit Quote" hspace="1" align="bottom"></a><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><span style="font-size: 9px;">uote</span></a>&nbsp;&nbsp;<img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconSpace1010.gif" alt=" " width="18" height="16" hspace="0" border="0">
    </td>
    <td headers="th6" valign="top">
        0070631319<br>QTY: 400
    </td>
    <td headers="th7" valign="top">
        09-07-2017
    </td>
    <td headers="th8" valign="top">
        09-18-2017
    </td>
</tr>
e.iluf
  • 1,389
  • 5
  • 27
  • 69
  • I would love to help you but since my browser is given me a warning about an untrusted connection I am not gonna enter the page you mentioned. You should not expect anybody here to do so. Just post an example of the HTML content here on SO and don't expect us to visist extern webpages. – dtell Sep 07 '17 at 14:47
  • 1
    @datell thanks. I have uploaded sample html from the site – e.iluf Sep 07 '17 at 14:58

1 Answers1

2

I analyzed the site you want to scrape, I found out that the site does have a page like a Terms and Condition that you need to agree before viewing the content. To be able to "agree" to that there is a need to submit a form. Thus, create a solution with 3 levels of fetches or retrieval of page source.

I used requests and html5lib on this example because it's easy to use. You can install them using pip

The last part is the parsing of the table and similar to what you did.

import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

request_headers = {'Accept': '*/*',
                   'Accept-Encoding': 'gzip, deflate, sdch',
                   'Accept-Language': 'en-US,en;q=0.8',
                   'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
                       }

req = requests.Session()
warning_url = 'https://www.dibbs.bsm.dla.mil/dodwarning.aspx'

# get initial warning page
get_warning_page = req.get(warning_url, headers=request_headers, verify=False)
warning_soup = BeautifulSoup(get_warning_page.content, 'html5lib')

# parse forms needed to be submitted later (T&C of the site that you need to agree before proceeding)
payload = {}
for inp in warning_soup.find('form').find_all('input'):
    payload[inp.get('name')] = inp.get('value')

# submit the warning form (means you already agreed on the T&C)
submit_warning_form = req.post(warning_url, headers=request_headers, data=payload, verify=False)

# lastly, navigate to the main page that contains the table
main_page = req.post('https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017', headers=request_headers, verify=False)

# parsing of table
dibbssoup = BeautifulSoup(main_page.content, 'html5lib')
#grabs each rfq
containers = dibbssoup.find_all("tr", {"class": "BgWhite"})

print(containers)

If you have any questions or encountered errors, just let me know. If this solved your issue, please mark it as answer. Thanks!

chad
  • 838
  • 1
  • 5
  • 16
  • Wow..thank you. at the end when i discovered it is still not capturing the data. when I tried to get the length of containers or even print it nothing shows up >>> len(containers) 0 >>> print(containers) [] – e.iluf Sep 07 '17 at 17:07
  • Are there any errors? Did you modify the code I sent? How do you run the code? – chad Sep 07 '17 at 17:10
  • That is so weird coz it's working fine with me. Perhaps you can print main_page and check if `class="BgWhite"` is there – chad Sep 07 '17 at 17:29
  • I am surprised too, there was nothing in the container. Please look at the screen shot of the terminal that I uploaded – e.iluf Sep 07 '17 at 17:33
  • I see you typed it one by one per line. You can save it in a .py file and run it as one. – chad Sep 07 '17 at 17:36
  • it works when i run the file. thank you. just curious, why do you think it didn't for line by line – e.iluf Sep 07 '17 at 17:51
  • The code uses a session request and that needs to be preserved until the end of the code. When you typed it line by line there is a huge chance that the session will break somewhere in the middle. – chad Sep 07 '17 at 17:54
  • 1
    oh ok. I wish I can thank more thank accepting the answer, but thank you my friend – e.iluf Sep 07 '17 at 17:55