2

Using the IMPORTXML function, is it possible to construct an XPATH query that pulls the Industry value for a given Wikipedia page?

For example, the value I want to pull from this page - https://en.wikipedia.org/wiki/Target_Corporation - is "Retail" whereas on this page - https://en.wikipedia.org/wiki/Boohoo.com - it would be "Fashion".

player0
  • 124,011
  • 12
  • 67
  • 124
zgall1
  • 2,865
  • 5
  • 23
  • 39

2 Answers2

2
  • You want to create the xpath for retrieving the Industry value for a given Wikipedia page.

If my understanding is correct, as other pattern, how about the formula with this xpath? Please think of this as just one of several answers.

Sample formula:

=IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td")
  • The xpath is //th[text()='Industry']/following-sibling::td.
  • In this case, the URL of https://en.wikipedia.org/wiki/Target_Corporation or https://en.wikipedia.org/wiki/Boohoo.com is put in the cell "A1".

Result:

enter image description here

Reference:

Added:

From your replying, I knew that you want to add 2 more URLs. So all URLs are as follows.

Issue and workaround:

For above URLs, when the formula of =IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td") is used, Retail, Fashion, Retail and Travel, services are returned.

When the xpath is modified to //th[text()='Industry']/following-sibling::td/a, Retail, #N/A, #N/A and Travel are returned.

The reason of this is due to the following difference.

<tr>
  <th scope="row">Industry</th>
  <td class="category"><a href="/wiki/Travel" title="Travel">Travel</a> services</td>
</tr>

and

<tr>
  <th scope="row" style="padding-right:0.5em;">Industry</th>
  <td class="category" style="line-height:1.35em;"><a href="/wiki/Retail" title="Retail">Retail</a></td>
</tr>

and

<tr>
  <th scope="row" style="padding-right:0.5em;">Industry</th>
  <td class="category" style="line-height:1.35em;">Fashion</td>
</tr>

By this, I think that unfortunately, in order to retrieve Travel, Retail and Fashion from above, those cannot be directly retrieved with only one xpath. So I used a built-in function for this situation.

Workaround:

In this workaround, I used INDEX. Please think of this as just one of several answers.

=INDEX(IMPORTXML(A1,"//th[text()='Industry']/following-sibling::td"),1,1)
  • The xpath is //th[text()='Industry']/following-sibling::td. This is not modified.
  • In this case, the URL is put in the cell "A1".
  • When 2 values are retrieved, the 1st one is retrieved. By this, I used INDEX.
Result:

enter image description here

Tanaike
  • 181,128
  • 11
  • 97
  • 165
  • This is very helpful. Thanks! It is leading to odd results in some cases.For this page - https://en.wikipedia.org/wiki/Woot - it returns a "Resource at url not found" error even though the link works and the HTML appears to have the same structure as the other pages. The bigger problem is that for industries that are described by multiple words (example - https://en.wikipedia.org/wiki/TripAdvisor), the formula puts the all the words after the second word in neighbouring cells (rather than concatenating everything into a single cell). I wonder if either of these problems can be addressed. – zgall1 Oct 31 '19 at 13:01
  • @zgall1 Thank you for replying. I couldn't notice that you want to also use this formula to other URLs. This is due to my poor skill. I deeply apologize for this. I added one more formula for 4 URLs you provided. Could you please confirm it? Unfortunately, I think that in order to retrieve values from 4 URLs, those cannot be directly retrieved with only one xpath. So I used a built-in function for this situation. If that was not the direction you want, I apologize. – Tanaike Nov 01 '19 at 00:29
  • I apologize for the late reply. This did work for me. – zgall1 Jan 17 '20 at 16:13
  • @zgall1 Thank you for replying. I'm glad your issue was resolved. Thank you, too. – Tanaike Jan 17 '20 at 23:34
0

try:

=INDEX(IMPORTXML("https://en.wikipedia.org/wiki/Boohoo.com", 
 "//td[@class='category']"), 2, 1)

=INDEX(IMPORTXML("https://en.wikipedia.org/wiki/Target_Corporation", 
 "//td[@class='category']"),2,1)

0

player0
  • 124,011
  • 12
  • 67
  • 124