-2

Here is my code for Regex matching which worked for a webpage:

public class RegexTestHarness {

    public static void main(String[] args) {

        File aFile = new File("/home/darshan/Desktop/test.txt");
        FileInputStream inFile = null;
        try {
            inFile = new FileInputStream(aFile);
        } catch (FileNotFoundException e) {
            e.printStackTrace(System.err);
            System.exit(1);
        }

        BufferedInputStream in = new BufferedInputStream(inFile);
        DataInputStream data = new DataInputStream(in);
        String string = new String();
        try {
            while (data.read() != -1) {
                string += data.readLine();
            }

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Pattern pattern = Pattern
                .compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
        Matcher matcher = pattern.matcher(string);
        boolean found = false;
        while (matcher.find()) {
            System.out.println("Name: " + matcher.group(1) );
            found = true;
        }
        if(!found){
            System.out.println("Pattern Not found");
        }
    }
}

But the same code doesn't work on the crwaler code for which I'm testing the regex, my crawler code is:(I'm using Websphinx)

// Our own Crawler class extends the WebSphinx Crawler
public class MyCrawler extends Crawler {

    MyCrawler() {
        super(); // Do what the parent crawler would do
    }

    // We could choose not to visit a link based on certain circumstances
    // For now we always visit the link
    public boolean shouldVisit(Link l) {
        // String host = l.getHost();
        return false; // always visit a link
    }

    // What to do when we visit the page
    public void visit(Page page) {
        System.out.println("Visiting: " + page.getTitle());
        String content = page.getContent();

        System.out.println(content);

        Pattern pattern = Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>");
        Matcher matcher = pattern.matcher(content);
        boolean found = false;
        while (matcher.find()) {
            System.out.println("Name: " + matcher.group(1) );
            found = true;
        }
        if(!found){
            System.out.println("Pattern Not found");
        }
    }
}

This is my code for running the crawler:

public class WebSphinxTest {

    public static void main(String[] args) throws MalformedURLException, InterruptedException {

        System.out.println("Testing Websphinx. . .");

        // Make an instance of own our crawler
        Crawler crawler = new MyCrawler();
        // Create a "Link" object and set it as the crawler's root
        Link link = new Link("http://justeat.in/restaurant/spices/5633/indian-tandoor-chinese-and-seafood/sarjapur-road/bangalore");
        crawler.setRoot(link);

        // Start running the crawler!
        System.out.println("Starting crawler. . .");
        crawler.run(); // Blocking function, could implement a thread, etc.

    }

}

A little detail about the crawler code. shouldvisit(Link link) filters whether to visit a link or not. visit(Page page) decides what to do when we get the page.

In the above example, test.txt and content contains the same String

Charles
  • 50,943
  • 13
  • 104
  • 142
darshan
  • 1,230
  • 1
  • 11
  • 17

1 Answers1

3

In your RegexTestHarness you're reading in lines from a file and concatenating the lines without line breaks after which you do your matching (readLine() returns the contents of the line without the line breaks!).

So in the input of your MyCrawler class, there probably are line break characters in the input. And since the regex meta-char . by default does not match line break chars, it doesn't work in MyCrawler.

To fix this, append (?s) in from of all your patterns that contain a . meta char. So:

Pattern.compile("<div class=\"rest_title\">.*?<h1>(.*?)</h1>")

would become:

Pattern.compile("(?s)<div class=\"rest_title\">.*?<h1>(.*?)</h1>")

The DOT-ALL flag, (?s), will cause the . to match any character, including line break chars.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288