3

I'm quite new to Java, but how would I go about searching a file for a tag, then everything between the tags, like a string of text, would be assigned to a variable.

For example, I'd have <title>THE TITLE</title>, But then I wanted to save the string "THE TITLE" to a variable called title1, or something.

How should I go about doing that? Thank you.

amit
  • 175,853
  • 27
  • 231
  • 333
Ben
  • 45
  • 1
  • 5

2 Answers2

5

If you use regular expressions, then you just use a capture group:

Pattern p = Pattern.compile("<title>([^<]*)</title>", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(theText);
if (m.find()) {
    String thisIsTheTextYouWant = m.group(1);
    ....
Ernest Friedman-Hill
  • 80,601
  • 10
  • 150
  • 186
2

You should not use regex to parse HTML: RegEx match open tags except XHTML self-contained tags

Try jsoup http://jsoup.org/cookbook/extracting-data/attributes-text-html

String html = "<title>THE TITLE</title>";
Document doc = Jsoup.parse(html);
Element title = doc.select("title").first();
String result = title.text();
Community
  • 1
  • 1
bpgergo
  • 15,669
  • 5
  • 44
  • 68
  • Note that he's not parsing the whole document; he's grabbing the text of specific elements. Using a regex is going to be way more efficient if he's, say, indexing web pages by their titles. If he's writing a web browser, then yeah, he needs a parser. But people are way too quick to introduce dependencies like this when they're not necessary. – Ernest Friedman-Hill Aug 17 '11 at 14:04
  • @Ernest, I agree partly: Using a regex is going to be way more efficient in special cases. E.g. if OP wants to process html files from one specific source at one specific time. But if OP will process html files from all different sources or through longer period of time, then a regexp solution will fail sooner or later - there is so much tumbler out there. It is not just my _opinion_, it is my experience, I did much screenscraping. You want something quick and dirty? Go for regexp. Want something robust and long-lasting? Go for a HTML parser. – bpgergo Aug 17 '11 at 14:14