-1

Here is the HTML I am attempting to extract MIBLAKD02129 from and write each row of violations. This is the output I need.

12/18/2019 MIBLAKD02129 MI NONE IL TRUCK TRACTOR 3 HOS Compliance Violation: 395.8A-ELD ELD - No record of duty status (ELD Required) (OOS) 5 + 2 (OOS)
Vehicle Maint. Violation: 393.9TS Inoperative turn signal 6
Vehicle Maint. Violation: 393.75(c) Tire-other tread depth less than 2/32 of inch measured in 2 adjacent major tread grooves 8

MY CODE

for rows in insp_tbl.find('tr', {'class' : 'inspection'}):
    print(rows)

<td>12/18/2019</td>
<td>
<a class="modalLink"href="/SMS/Event/Inspection/68897149.aspx">MIBLAKD02129</a>
</td>
<td>MI</td>
<td>NONE</td>
<td>IL</td>
<td>TRUCK TRACTOR</td>
<td> </td>
<td>3</td>

for rows in insp_tbl.find('tr', {'class' : 'inspection'}):
    for cols in rows.find('td'):
        print(cols)

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
TypeError: 'int' object is not iterable


<tr class="inspection">
    <td>12/18/2019</td>
    <td>
       <a class="modalLink" href="/SMS/Event/Inspection/68897149.aspx">MIBLAKD02129</a>
    </td>
    <td>MI</td>
    <td>NONE</td>
    <td>IL</td>
    <td>TRUCK TRACTOR</td>
    <td>&nbsp;</td>
    <td>3</td>
</tr>

    <tr class="viol oos">
      <td colspan="6" class="viol">
        <label>HOS Compliance Violation:&nbsp;</label>
           <span class="violCodeDesc">395.8A-ELD ELD - No record of duty status (ELD Required)(OOS)</span>
     </td>
     <td class="weight">5 + 2 (OOS)</td>
     <td colspan="1">&nbsp;</td>
   </tr>

   <tr class="viol ">
     <td colspan="6" class="viol">
       <label>Vehicle Maint.Violation:&nbsp;</label>
         <span class="violCodeDesc">393.9TS Inoperative turn signal</span>
     </td>
     <td class="weight">6</td>
     <td colspan="1">&nbsp;</td>
  </tr>

  <tr class="viol ">
    <td colspan="6" class="viol">
       <label>Vehicle Maint. Violation:&nbsp;</label>
          <span class="violCodeDesc">393.75(c) Tire-other tread depth less than 2/32 of inch measured in 2 adjacent major tread grooves</span>
   </td>
   <td class="weight">8</td>
   <td colspan="1">&nbsp;</td>
 </tr>
Nguyễn Văn Phong
  • 13,506
  • 17
  • 39
  • 56
Justice
  • 1
  • 2
  • check the output of `insp_tbl.find('tr', {'class' : 'inspection'})` – sahasrara62 Jan 25 '20 at 04:19
  • `find()` gives single element, not list of elements. You can't iterate single element. You need `find_all()` to get list with all elements – furas Jan 25 '20 at 04:33
  • Always share the entire error message, and a [mcve]. This issue is trivial, have you not read the BeautifulSoup docs? – AMC Jan 25 '20 at 04:40
  • Does this answer your question? [Python - TypeError: 'int' object is not iterable](https://stackoverflow.com/questions/19523563/python-typeerror-int-object-is-not-iterable) – AMC Jan 25 '20 at 04:41

2 Answers2

0

You have to use find_all() instead of find()

from bs4 import BeautifulSoup as BS


html = '''<tr class="inspection">
    <td>12/18/2019</td>
    <td>
       <a class="modalLink" href="/SMS/Event/Inspection/68897149.aspx">MIBLAKD02129</a>
    </td>
    <td>MI</td>
    <td>NONE</td>
    <td>IL</td>
    <td>TRUCK TRACTOR</td>
    <td>&nbsp;</td>
    <td>3</td>
</tr>

    <tr class="viol oos">
      <td colspan="6" class="viol">
        <label>HOS Compliance Violation:&nbsp;</label>
           <span class="violCodeDesc">395.8A-ELD ELD - No record of duty status (ELD Required)(OOS)</span>
     </td>
     <td class="weight">5 + 2 (OOS)</td>
     <td colspan="1">&nbsp;</td>
   </tr>

   <tr class="viol ">
     <td colspan="6" class="viol">
       <label>Vehicle Maint.Violation:&nbsp;</label>
         <span class="violCodeDesc">393.9TS Inoperative turn signal</span>
     </td>
     <td class="weight">6</td>
     <td colspan="1">&nbsp;</td>
  </tr>

  <tr class="viol ">
    <td colspan="6" class="viol">
       <label>Vehicle Maint. Violation:&nbsp;</label>
          <span class="violCodeDesc">393.75(c) Tire-other tread depth less than 2/32 of inch measured in 2 adjacent major tread grooves</span>
   </td>
   <td class="weight">8</td>
   <td colspan="1">&nbsp;</td>
 </tr>'''

soup = BS(html)

for row in soup.find_all('tr'):
    print('---')
    for col in row.find_all('td'):
        print('col:', col.text.strip())

Result:

---
col: 12/18/2019
col: MIBLAKD02129
col: MI
col: NONE
col: IL
col: TRUCK TRACTOR
col: 
col: 3
---
col: HOS Compliance Violation: 
395.8A-ELD ELD - No record of duty status (ELD Required)(OOS)
col: 5 + 2 (OOS)
col: 
---
col: Vehicle Maint.Violation: 
393.9TS Inoperative turn signal
col: 6
col: 
---
col: Vehicle Maint. Violation: 
393.75(c) Tire-other tread depth less than 2/32 of inch measured in 2 adjacent major tread grooves
col: 8
col: 

Or use get_text() to get all text in row as single string

for row in soup.find_all('tr'):
    print('---')
    print('row:', row.get_text(strip=True, separator=' '))

Result:

---
row: 12/18/2019 MIBLAKD02129 MI NONE IL TRUCK TRACTOR 3
---
row: HOS Compliance Violation: 395.8A-ELD ELD - No record of duty status (ELD Required)(OOS) 5 + 2 (OOS)
---
row: Vehicle Maint.Violation: 393.9TS Inoperative turn signal 6
---
row: Vehicle Maint. Violation: 393.75(c) Tire-other tread depth less than 2/32 of inch measured in 2 adjacent major tread grooves 8
furas
  • 134,197
  • 12
  • 106
  • 148
  • There is an enomorous amount of space between HOS Compliance and Violation when I run the code. How do I remove it. It looks as if there is an extra column with no data. This is how the output looks col: HOS Compliance Violation: 395.8A-ELD ELD - No record of duty status (ELD Required) (OOS) col: 5 + 2 (OOS) – Justice Jan 27 '20 at 16:43
  • I use `.get_text(strip=True, separator=' ')` to remove spaces, tabs, enters. If you use `.text` then you have to remove spaces, tabs, enters on your own. You could use `.replace(" ", "").replace("\t", "").replace("\n", "")` but it may not remove all elements and using regex for this can be the only solution. `re.sub('[ \t\n]+', " ", text)` – furas Jan 27 '20 at 17:02
0

Tr this, first find a root class and find all tr tags and with each tr tags findall td tags and print the text within that tag on by one.

from bs4 import BeautifulSoup
data = requests.get(url)
data = data.content
bs = BeautifulSoup(data)
root = bs.('tr', {'class' : 'inspection'})
trs = root.findall("tr")
for tr in trs:
    tds = tr.findall('td')
    for td in tds:
        val = td.text
        print val
    print "\n"
Raguram Gopi
  • 202
  • 1
  • 2
  • 8