0

I am parsing some paragraphs in a table.

Here’s the content and code.

txt = '''
<head><META http-equiv="Content-Type" content="text/html; charset=UTF-8">    </head><table><tr><th filter=all>Employee Name</th><th filter=all>Project     Name</th><th filter=all>Area</th><th filter=all>Date</th><th filter=all>Employee     Manager</th></tr>
<tr><td style="vnd.ms-excel.numberformat:@">David</td><td style="vnd.ms-    excel.numberformat:@">Review-2016</td><td style="vnd.ms-    excel.numberformat:@">US</td><td align=right>17/03/2016</td><td style="vnd.ms-    excel.numberformat:@">Andrew</td></tr>
<tr><td style="vnd.ms-excel.numberformat:@">Kate</td><td style="vnd.ms-excel.numberformat:@">Review 2016</td><td style="vnd.ms-excel.numberformat:@">UK</td><td align=right>21/03/2016</td><td style="vnd.ms-excel.numberformat:@">Liz</td></tr>

'''

soup = BeautifulSoup(txt, "lxml")
soup.prettify()

list_5 = soup.find_all('table')[0].find_all("tr")

for row in list_5:
    for nn in row.find_all("td"):
        print nn.text

So far the texts are got but all in together, i.e.:

David
Review-2016
US
17/03/2016
Andrew
Kate
Review 2016
UK
21/03/2016
Liz 

What’s needed is in column forms, like David, Kate or US, UK etc.

Can you help me with the right way? Thank you.

Mark K
  • 8,767
  • 14
  • 58
  • 118

1 Answers1

2

If you want to print David, Kate, code below will work:

 for row in list_5[1:]:
      print(row.find_all('td')[0].text)
 #change find_all('td')[0] to find_all('td')[2] will print US UK
nick
  • 843
  • 2
  • 7
  • 17
  • Can you please help me solve a similar problem of mine. Here: http://stackoverflow.com/questions/43033378/web-scraping-with-selenium-python-twitter-instagram – Sitz Blogz Apr 17 '17 at 03:41
  • The solution was provided partially, as I was able to get partial and the other partial i.e. parsing the output to dataframe was my big challenge. – Sitz Blogz Apr 17 '17 at 03:46
  • 1
    I will try later! – nick Apr 17 '17 at 03:47
  • @nick, by the way. Could you please show me the way getting how many columns in the table? i.e. print(row.find_all('td')[X].text) - how do I know the X value? thank you. – Mark K Apr 17 '17 at 04:01
  • in `David... `,`David`is the first child element,so ,the index number of it in the array from `row.find_all('td')` is `0` . `td` contains `US` is the 3rd child element,so its index number is `2`. – nick Apr 17 '17 at 04:06
  • @nick, thank you again. And how do I know how many child elements in total there? – Mark K Apr 17 '17 at 04:20
  • There are many `xpath` tools. In chrome, after you open a web page and right click->click inspect->click Elements->find the element you need->right click->hover on copy->click `copy XPath`, you will get the path of the element. – nick Apr 17 '17 at 04:51