0

I am trying to extract the date and text from HTML with the following structure. I am using goquery to do this.

<body>
<div class="wrap">
    <div class="cont">
        <div class="cont_block">
            <p class="date">
                <font>
                    <font>Saturday, Apr 16,2016</font>
                </font>
            </p>
            <div class="block_table">
                <table class="left" width="auto" height="auto" border="0" cellpadding="0" cellspacing="0">
                    <tbody>
                        <tr>
                            <td class="left_top2"></td>
                            <td width="auto" class="bg_color2" height="12"></td>
                            <td class="right_top2"></td>
                        </tr>
                        <tr>
                            <td height="auto" class="right_mid2"></td>
                            <td class="bg_color2">
                                <font>
                                    <font>Loerem ipsum dolor sit amet</font>
                                </font>
                            </td>
                            <td class="bg_color2" width="14"></td>
                        </tr>
                        <tr>
                            <td class="left_bottom2"></td>
                            <td class="bg_color2"></td>
                            <td class="right_bottom2"></td>
                        </tr>
                    </tbody>
                </table>
            </div>
        </div>

        <div class="cont_block">
            <p class="date">Friday,Dec 18,2015</p>
            <div class="block_table">
                <table class="right" width="auto" height="auto" border="0" cellpadding="0" cellspacing="0">
                    <tbody>
                        <tr>
                            <td class="left_top3"></td>
                            <td width="auto" class="bg_color3" height="12"></td>
                            <td class="right_top3"></td>
                        </tr>
                        <tr>
                            <td height="auto" class="bg_color3" width="14">&nbsp;</td>
                            <td class="bg_color3">Loerem ipsum dolor sit amet</td>
                            <td class="right_mid3"></td>
                        </tr>
                        <tr>
                            <td class="left_bottom3"></td>
                            <td class="bg_color3"></td>
                            <td class="right_bottom3"></td>
                        </tr>
                    </tbody>
                </table>
            </div>
            <div class="block_table">
                <table class="right" width="auto" height="auto" border="0" cellpadding="0" cellspacing="0">
                    <tbody>
                        <tr>
                            <td class="left_top3"></td>
                            <td width="auto" class="bg_color3" height="12"></td>
                            <td class="right_top3"></td>
                        </tr>
                        <tr>
                            <td height="auto" class="bg_color3" width="14">&nbsp;</td>
                            <td class="bg_color3">Loerem ipsum dolor sit amet</td>
                            <td class="right_mid3"></td>
                        </tr>
                        <tr>
                            <td class="left_bottom3"></td>
                            <td class="bg_color3"></td>
                            <td class="right_bottom3"></td>
                        </tr>
                    </tbody>
                </table>
            </div>
        </div>
    </div>
</div>

I have tried many ways of doing this for example:

doc.Find(".wrap .cont .cont_block").Each(func(i int, s *goquery.Selection) {
    fmt.Println(s.Find(".date").Text())
    s.Find(".block_table td").Each(func(j int, c *goquery.Selection){
        if c.Text() != "" {
           fmt.Println(c.Text())
        }
    })
})

The problem is that the results returned by the find for the dates and the text return results that are outside of the scope of the .cont_block. Basically it returns all of the dates and td from the document that are below the currently selected .cont_block on each iteration.

What am I missing?

Jet Basrawi
  • 3,185
  • 2
  • 15
  • 14
  • 1
    I don’t understand your problem. I ran your code, and it looks like it is doing what you ask for. Could you give an example of what you get and how it differs from what you expect? – Zoyd Mar 26 '17 at 07:54
  • You are quite correct. The html files that I am processing are quite large and it turns out they are not valid. They have unclosed elements that is causing the problems. – Jet Basrawi Mar 26 '17 at 21:22

1 Answers1

0

The problem was unclosed elements in the HTML files that I am processing.

Jet Basrawi
  • 3,185
  • 2
  • 15
  • 14