1

I'm scraping this site, specifically the content of the tables inside the div tags with class containing 'ranking-data'. So for the first td that would be: //div[contains(@class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[1]/text()"

This is working fine for all columns in all tables (with needed modifications) except for a cell in column 2 that contains an i tag: on Google Spreadsheets it adds an extra blank cell below the cell with the text itself. I've first tried to scrap it with: //div[contains(@class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()

Then I've tried something like *[not(i[contains(@class,'info-circle')])]/text() after the td[2], and some other variants, but it doesn't work.

How can I avoid this i tag?

player0
  • 124,011
  • 12
  • 67
  • 124
curropar
  • 334
  • 2
  • 17

2 Answers2

1

try:

=QUERY(IMPORTXML(A1, "//div[contains(@class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()"), "where Col1 <>' '", )

enter image description here

player0
  • 124,011
  • 12
  • 67
  • 124
  • 1
    That was it. I was hitting my head against a wall with the XPath, and I forgot to try things outside the `IMPORTXML`. Well, thanks!! – curropar Nov 16 '22 at 18:26
1

Answer given by @player0 is working for my case, and since it was the first answer I won't remove the "accepted" mark from it; but I'm stubborn and I've find an alternative with just XPath (which may be useful for other cases). It was as simple as adding an [1] at the end of my first query:

//div[contains(@class, 'ranking-data')]//tr[th//text()[contains(., 'TIN')]]/td[2]/text()[1]

curropar
  • 334
  • 2
  • 17