I have an Issue with InputstreamReader and html

Question

I am trying to scrap a table content from a URL using java but the scraper is apparently not working correctly. I Used the java docs on inputstreamReader and other online examples but was not able to figure out what my problem is. The problem is that, the inputstreamReader is skipping two columns of every even row in the table while getting the last column. Every odd row produces the desired results. Below is my code and output enter image description here .

The source table looks like this: enter image description here

Lastly, the output looks like this: enter image description here

In html term, each column in a row is a tag which is read in as lines. Since two columns are skipped does it mean that the inputStreamReader is skipping two line? I was thinking it would be a regEx problem but that couldn't be the cause because the rest of the output is correct. I want to be able to output or read in all rows and columns correctly to be able to proceed.

Double check your regular expressions...be sure they take into account variations in syntax for each table entry (eg plausible spaces). — copeg, Jun 05 '15 at 23:49

score 0 · Answer 1 · 2015-06-06T03:31:30.240

0

Price patterns are different in the odd and even rows.

Odd rows:

    <tr>
        <td>16:00:52</td>
        <td>$&nbsp;82.14&nbsp; </td>
        <td>763</td>
    </tr>

Even rows:

    <tr>
        <td>16:00:52 </td>
        <td>$&nbsp;82.14 &nbsp;</td>
        <td>8,116</td>
    </tr>

The pattern that matches both cases is:

String preicePattern = "<td>\\$&.+;(\\d{1,4}\\.\\d{1,4}) *&";

edited Jun 06 '15 at 03:31

answered Jun 06 '15 at 00:15

Hi Saka1029, your example didn't work for me but i was able to solve the problem by using: String preicePattern = "\\$&.+;(\\d{1,4}\\.\\d{1,4}) *&"; – user3422517 Jun 06 '15 at 03:21

I have an Issue with InputstreamReader and html

1 Answers1