3

Hello dear internauts!

I want to extract the country distribution from: https://www.etf.com/URTH - "URTH Top 10 Countries" halfway down the page. I intend to expand to other ETFs thereafter: So far, the only things I can imagine are ugly:

  1. Searching for the label name. Ugly because it would require searching for every country in the world, just to find the ten that are actually on there.

=IF(NOT(ISBLANK($A$1)),(importxml(concatenate("https://www.etf.com/",$A$1), "(//label[normalize-space()='United States']//following::span[@id=''][1])")),"")

  1. Extracting the entire HTML and then beginning some series of string operations. Ugly because seems inherently instable.

=IF(NOT(ISBLANK($A$1)),index(IMPORTXML(concatenate("https://www.etf.com/",$A$1,"#overview"), "html"),1,2))

Other things I've tried: Copying the xpath / full xpath (with and without javascript disabled). As well as trying to create my own xpath (been a few hours :D)

One added hurdled I see is that, with javascript off, there aren't many IDs, and there are other lists nearby with the same @class structure.

Desired result: United States 68.34% Japan 6.63% United Kingdom 4.29% etc. Each country in a single or two cells (country name / percentage).

Some insight in how you proceeded to build the functioning query would be really welcome. If there is a completely different approach, I'm all ears!

Note: A1=URTH

player0
  • 124,011
  • 12
  • 67
  • 124
Melvin
  • 33
  • 5

1 Answers1

2

use:

=ARRAYFORMULA(TRIM(QUERY(IMPORTXML("https://www.etf.com/"&A1, 
 "//div[@class='rowText col-md-12 col-sm-12 col-xs-12']"), 
 "limit 10", 0)))

enter image description here


if you want to work with those numbers:

=INDEX(QUERY(IMPORTXML("https://www.etf.com/"&A1, 
 "//div[@class='rowText col-md-12 col-sm-12 col-xs-12']"), 
 "limit 10", 0))
player0
  • 124,011
  • 12
  • 67
  • 124