3

I have a piece of content which has both html and rss, I would like to separate them and store in individual strings. So, I am trying to parse them based on their start and close tags and grab content between rss /rss .

Code works fine for html & /html. However I am seeing errors for rss & /rss.

Below is my code snippet.

// parse the responseStr to html
html = responseStr.substring(responseStr.indexOf("<html>"),
responseStr.lastIndexOf("</html>") + 7);
System.out.println("html string"+html );

Can someone please guide me what is wrong with the below code?

// parse the responseStr to rss
rss = responseStr.substring(responseStr.indexOf("<rss version="2.0">"),
responseStr.lastIndexOf("</rss>") + 6);
System.out.println("rss string = "+rss );

I get the below exception:

  java.lang.StringIndexOutOfBoundsException
    at java.lang.String.substring(String.java:1093)
smiley
  • 491
  • 3
  • 14
  • 36
  • 1
    What do you mean by *I am seeing errors* - Also, can you post the text you're trying to parse? – Mike Christensen Aug 26 '13 at 18:13
  • why not use a library? an xml parser at the very least would allow you to use xpath – radai Aug 26 '13 at 18:13
  • What errors do you see? Add them in the question please – c.s. Aug 26 '13 at 18:14
  • The above code works for me if your input string is ` ... `. Please post your input string. – Sotirios Delimanolis Aug 26 '13 at 18:18
  • chances are `responseStr.lastIndexOf("") + 6` doesn't exist – 0x6C38 Aug 26 '13 at 18:20
  • I have posted the exception above. The rss content(input string) I am trying to parse is very huge and it is in the rss standard format and coming as a string next to html content – smiley Aug 26 '13 at 18:20
  • Just to take a stab in the dark, does the input start with ``? Or does the rss tag, perhaps, have attributes, say something like: ``. – femtoRgon Aug 26 '13 at 18:23
  • The input has content starting with and followed by and ends with . So, I am trying to parse out this content to get content between and including and – smiley Aug 26 '13 at 18:29
  • Check if your text starts with `` some tags end with `/>`. Also some tags in HTML does not have end tags. check the case as well Better to use some HTML/RSS parsers IMHO – Prashant Bhate Aug 26 '13 at 19:39

3 Answers3

4

It is likely that your call to substring is being passed invalid indexes for your responseStr. You need to verify that your string actually contains the <rss> and </rss> tags before you call substring.

Try this:

String result;
int start = responseStr.indexOf("<rss>");
int end = responseStr.lastIndexOf("</rss>");

if (start != -1 && end != -1)
{
  result = "rss string = " + responseStr.substring(start, end + 6);
}
else
{
  result = "rss string not found";
}

System.out.println(result);

From the JavaDocs for String.indexOf, we know that if the string does not occur, -1 will be returned.

Luke Willis
  • 8,429
  • 4
  • 46
  • 79
  • I do have rss string, however, when I use your code it prints rss string not found. – smiley Aug 26 '13 at 18:34
  • @smiley If "rss string not found" is being printed, then one or both of the rss tags are missing. You need to inspect your string. You can also alter the above code to tell you which of the tags (opening or closing) is missing specifically. – Luke Willis Aug 26 '13 at 18:42
  • ah.. I just noticed I am getting rss tag as and not as as someone mentioned above. How do I specify it in the code? I think I might need escape characters since I am not able to use version="2.0" directly. – smiley Aug 26 '13 at 18:48
  • @smiley Take a look at [this answer](http://stackoverflow.com/a/8938549/2479481) and use the pattern `"]*>"` – Luke Willis Aug 26 '13 at 18:56
  • 1
    Thank you very much Luke! I tried this, I know its pretty crude -- responseStr.indexOf(" – smiley Aug 26 '13 at 19:51
  • 1
    If you're going to just use `indexOf`, I would recommend using `" – Luke Willis Aug 26 '13 at 19:54
  • I also have tag before . Thats why when I tried " I do not want to keep this – smiley Aug 26 '13 at 19:58
3

I think it would be easier by using

StringUtils.substringsBetween(String str,String open,String close)

javadoc

apache commons

Example:

String[] rss= StringUtils.substringsBetween(testHtml, "<rss>", "</rss>");
    for (String s : rss) {
        System.out.println("td rss:" + rss); 
}

public static String substringBetween(String str, String open, String close) {
    if (str == null || open == null || close == null) {
        return null;
    }
    int start = str.indexOf(open);
    if (start != INDEX_NOT_FOUND) {
        int end = str.indexOf(close, start + open.length());
        if (end != INDEX_NOT_FOUND) {
            return str.substring(start + open.length(), end);
        }
    }
    return null;
}
Khinsu
  • 1,487
  • 11
  • 27
2

I would recommend xml parser though instead of below code

public static void main(String[] args) {
    String responseStr = "<rss ...>------content-----</rss>";
    int start = responseStr.indexOf("<rss");
    String content = null;
    if (start != -1) {
        start = responseStr.indexOf(">", start);
        if (start != -1) {
            int end = responseStr.lastIndexOf("</rss>");
            if (end != -1) {
                content = responseStr.substring(start + 1, end);
            }
        }
    }
    if (content != null)
        System.out.println(content);
    else
        System.err.println("Content not found");

}

Output

------content-----
Prashant Bhate
  • 10,907
  • 7
  • 47
  • 82