0

I have to parse an html page organized this way:

<li id="ctl00_EFG" class="current">
  <a id="ctl00_SGB" href="http://SGI/EFG">EFG</a>
  <ul style="width:535px;">
    <li class="top_border">
      <a style='color: #d94129; font-weight: bold;' href="http://SGI/EFG/regione-abruzzo" title="EFGAbruzzo">Abruzzo</a>
      <ul style="width:100%;">
        <li>
          <a href="http://SGI/EFG/chieti" title="EFG chieti" rel="nofollow">Chieti</a>
        </li>
        <li>
          <a href="http://SGI/EFG/pescara" title="EFG pescara" rel="nofollow">Pescara</a>
        </li>
      </ul>
    </li>
    <li class="top_border"><a style='color: #d94129; font-weight: bold;' href="http://SGI/EFG/regione-valdaosta" title="EFGValDAosta">Val d'Aosta</a>
      <ul style="width:100%;">
        <li>
          <a href="http://SGI/EFG/aosta" title="EFG aosta" rel="nofollow">Aosta</a>
        </li>
      </ul>
    </li>
  </ul>
</li>

I need to extract an object with the regions and the cities, like this:

{
  "Abruzzo": [
    "Chieti" , "Pescara",
  ],
  "Val d'Aosta": [
    "Aosta",
  ],
};

I am using cheerio from node.js, but I added jquery to the tags since cheerio uses jquery-style selector (AFAIK...).

I have come with this partial solution, which is not working ...

$('a[id="ctl00_SGB"]').next().find('ul li').each(function(i, elem) {
  var $categoryTop = $(this);
  var region = $categoryTop.find('a').first().attr('rel', ':not(nofollow)').text();
  console.log('region:', region);
  $(elem).find('ul li a').each(function(i, elem2) {
    console.log('elem2:', $(elem2).text());
});

Any clue?

P.S.: I am changing a question inserted yesterday, and answered correctly. Unfortunately, I did simplify it a bit too much, so I couldn't use the correct answer to my use case...

MarcoS
  • 17,323
  • 24
  • 96
  • 174

2 Answers2

3

This is fairly straightforward, start with an empty object, loop over the a elements under ctl00_EFG>ul>li and inside that build an array of the elements under ul>li>a.

var result = {};

$('#ctl00_EFG>ul>li>a').each(function(){
  
  var n = $(this).text();
  var a = $(this).next('ul').find('li a').map(function(){
        return $(this).text();
    }).get();
   result[n] = a;
  
});

console.log(result);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<li id="ctl00_EFG" class="current">
  <a id="ctl00_SGB" href="http://SGI/EFG">EFG</a>
  <ul style="width:535px;">
    <li class="top_border">
      <a style='color: #d94129; font-weight: bold;' href=http://SGI/EFG/regione-abruzzo title="EFGAbruzzo">Abruzzo</a>
      <ul style="width:100%;">
        <li>
          <a href=http://SGI/EFG/chieti title="EFG chieti" rel="nofollow">Chieti</a>
        </li>
        <li>
          <a href=http://SGI/EFG/pescara title="EFG pescara" rel="nofollow">Pescara</a>
        </li>
      </ul>
    </li>
    <li class="top_border"><a style='color: #d94129; font-weight: bold;' href=http://SGI/EFG/regione-valdaosta title="EFGValDAosta">Val d'Aosta</a>
      <ul style="width:100%;">
        <li>
          <a href=http://SGI/EFG/aosta title="EFG aosta" rel="nofollow">Aosta</a>
        </li>
      </ul>
    </li>
  </ul>
</li>
Jamiec
  • 133,658
  • 13
  • 134
  • 193
1

I would define and initialize an object, then use .each() on the regions and use the region as the key in each iteration and to get the value, I would use .map() to get an array of all the cities in the region. Something like this:

var obj = {};
$('li.top_border > a').each(function() {
    obj[ this.textContent ] = $(this).next().find('a').map(function() {
        return this.textContent;
    })
    .get();
});

console.log( JSON.stringify(obj) );
//Output: {"Abruzzo":["Chieti","Pescara"],"Val d'Aosta":["Aosta"]}

var obj = {};
$('li.top_border > a').each(function() {
    obj[ this.textContent ] = $(this).next().find('a').map(function() {
        return this.textContent;
    })
    .get();
});

console.log( JSON.stringify(obj) );
//Output: {"Abruzzo":["Chieti","Pescara"],"Val d'Aosta":["Aosta"]}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<li id="ctl00_EFG" class="current">
  <a id="ctl00_SGB" href="http://SGI/EFG">EFG</a>
  <ul style="width:535px;">
    <li class="top_border">
      <a style='color: #d94129; font-weight: bold;' href="http://SGI/EFG/regione-abruzzo" title="EFGAbruzzo">Abruzzo</a>
      <ul style="width:100%;">
        <li>
          <a href="http://SGI/EFG/chieti" title="EFG chieti" rel="nofollow">Chieti</a>
        </li>
        <li>
          <a href="http://SGI/EFG/pescara" title="EFG pescara" rel="nofollow">Pescara</a>
        </li>
      </ul>
    </li>
    <li class="top_border"><a style='color: #d94129; font-weight: bold;' href="http://SGI/EFG/regione-valdaosta" title="EFGValDAosta">Val d'Aosta</a>
      <ul style="width:100%;">
        <li>
          <a href="http://SGI/EFG/aosta" title="EFG aosta" rel="nofollow">Aosta</a>
        </li>
      </ul>
    </li>
  </ul>
</li>
PeterKA
  • 24,158
  • 5
  • 26
  • 48
  • You've made one mistake (I also made originally) - by using `.current li a` instead of `.current>li>a` you pick up the children not just direct descendants. The output is not what the OP asked for. – Jamiec Sep 11 '15 at 14:45
  • Thanks! Yes, it was a stupid mistake, I kept the cities repeated many times, in my output... :-( Oh, however, all regions do have at least one city... :-) – MarcoS Sep 11 '15 at 14:49
  • Those are not "Regions without cities" - they are the cities! – Jamiec Sep 11 '15 at 14:50
  • Thanks guys; looks like using `'li.top_border > a'` yields the expected results. Maybe adding classes such as `region` and `city` would help you or someone else looking through the code in the future. – PeterKA Sep 11 '15 at 15:06