I need to write a program in Java where I print out certain things from the website( like the title) but I need to take out the tags

Question

The main problem I'm having is parsing from the website to my program. I got it to print out the source code. Also if it doesn't contain 'http://' I need to add it. I really don't understand how to parse strings .

import java.net.*; 
import java.io.*; 
import java.util.Scanner;
public class Project6 { 
  public static void main (String [] args) throws Exception { 

    Scanner sc = new Scanner(System.in); 
    System.out.print("Please enter the URL. "); 
    String web= sc.nextLine(); 
    String foo = "http://allrecipes.com/";


//is "web" have an allrecipes.com url?
//if it doesn't, then exit
if ( web.equals(foo)) {  
  StringBuilder s = new StringBuilder(); 
URL recipes  = new URL (web); 
BufferedReader in = new BufferedReader(new InputStreamReader(recipes.openStream()));

String  inputLine; 

while ((inputLine = in.readLine ())!= null) 
  System.out.println(inputLine);
in.close(); 

}
else { 
   System.out.println("I'm  sorry, but that is not a valid allrecipes.com URL."); 
  System.exit(0); 
//does "web" start with "http://"
//if it doesn't, add it
}

look here.. sort of same http://stackoverflow.com/questions/9580684/how-to-retrieve-title-of-a-html-with-the-help-of-htmleditorkit — Balaji Krishnan, Oct 18 '13 at 07:48
You shouldn't be using `web.equals(foo)`, because you need to handle for if the user forgot http:// and if they entered a subdomain. A better check would be `web.indexOf("allrecipes") != -1`, that makes sure at least the domain is there. — William Gaul, Oct 18 '13 at 07:50
Use Pattern Matching i.e Regular Expression for printing out certain things from website — Prateek, Oct 18 '13 at 07:59

score 1 · Answer 1 · answered Oct 18 '13 at 08:02

Parsing HTML on your own is not a good idea. I would propose using jsoup library, which really helps with parsing and selecting elements.

Your code could look something like this with jsoup:

Document doc = Jsoup.connect(web).get();
Elements title = doc.select("title");

It is concise, readable and you can easily parse/select other elements if you need (eg. more complex css selectors like #recipes > div #recipe-title)

score 0 · Answer 2 · answered Oct 18 '13 at 07:50

0

You are looking for a web crawler. Just a couple: JSoup & Selenium(CSS selectors to retrieve elements), crawler4j(I haven't used it).

answered Oct 18 '13 at 07:50

Silviu Burcea

5,103
1
29
43

score 0 · Answer 3 · answered Oct 18 '13 at 07:51

0

Then your if condition should be

if(web.equlas(foo) || web.equlas(foo.replaceAll("http://", "")){


}

The above test passes if web equals to

http://allrecipes.com/

or

allrecipes.com/

As a side note: http://allrecipes.com/ <-- . There is no need for / in the end I guess.

answered Oct 18 '13 at 07:51

Suresh Atta

120,458
37
198
307

1

Actually, the ending / does matter sometimes! – Silviu Burcea Oct 18 '13 at 07:55
@SilviuBurcea Agreed. But when dealing with `String's` it matters :) – Suresh Atta Oct 18 '13 at 07:56

score 0 · Answer 4 · answered Oct 18 '13 at 08:07

Match input from `foo` :

Scanner sc = new Scanner(System.in);
System.out.print("Please enter the URL. ");
String web = sc.nextLine(); // Suppose "allrecipes.com";
String foo = "http://allrecipes.com"; // no need of / like this http://allrecipes.com/

// is "web" have an allrecipes.com url?
// if it doesn't, then exit
if (foo.matches(web) || foo.matches("http://"+web)) {
 ..........
}

In above case if user has entered allrecipes.com or http://allrecipes.com then only will be able to proceed further

I need to write a program in Java where I print out certain things from the website( like the title) but I need to take out the tags

4 Answers4

Match input from foo :

Match input from `foo` :