2

I am trying to extract data from this table at Espn cricinfo. Website being scraped

Each row is comprised of the folowing format (Data replaced by headers):

<tr class="data1"> <td class="left" nowrap="nowrap"><a>Player Name</a> (Country)</td> <td>Score</td> <td>Minutes Played</td> <td nowrap="nowrap">Balls Faced</td> <td etc... </tr>

I have used the following code in a python script to capture the values in the table:

bats    = content.xpath('//tr[@class="data1"]/td[1]/a')
cntry   = content.xpath('//tr[@class="data1"]/td[1]/*')
run     = content.xpath('//tr[@class="data1"]/td[2]')
mins    = content.xpath('//tr[@class="data1"]/td[3]')
bf      = content.xpath('//tr[@class="data1"]/td[4]')

The data is then put into a csv file for storage.

All of the data is successfully being captured apart from the country of the player. The player name and country are stored inside the same <td> tag; however, the player name is also inside an <a> tag, allowing it to be captured easily. My problem is that the value captured for the players country (the cntry variable above) is the players name. I am sure that the code is incorrect but I am not sure why.

enter image description here

1 Answers1

3

Where you have:

cntry = content.xpath('//tr[@class="data1"]/td[1]/*')

The '*' is looking for the child tags and passes by any text.

You can replace your line of code with this to grab the text instead of the tags:

cntry = content.xpath('//tr[@class="data1"]/td[1]/text()')

See if that works for you.

EDIT


To remove the white spacing at beginning of each item, just do the following:

cntry = content.xpath('//tr[@class="data1"]/td[1]/text()')
cntry = [str(x).strip() for x in cntry]
Wondercricket
  • 7,651
  • 2
  • 39
  • 58
  • Thank you this worked perfectly. There is however a blank space before the country name. Do you know why that might be? – Padraig Scott Aug 19 '14 at 21:40
  • @PadraigScott it is probably due to the spacing between the player name and the country. I'll update my answer to remove the white spacing – Wondercricket Aug 19 '14 at 22:04