1

I am creating a simple program that takes a search input from the user and the number of links that they want to receive. However, my code has 2 problems.

  1. When I print out the links, it also includes links for images, news, etc... I was wondering how I could only keep links related to the search.

     String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;
     Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get();     
     Elements results = doc.select("a[href]");
    

    However, filtering the document using a[href] includes all links on the page. I also tried using h3.r > a which returned no results at all and also a > h3, which fixed my current problem, but it would only display the title and not the actual link. Here is part of the output that I want to get rid of:

Title: Google
Link: https://www.google.com/?sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQOwgC

Title: Google
Link: https://www.google.com/?num=5&output=search&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQPAgE

Title: News
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=nws&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICCgB

Title: Images
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=isch&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICSgC

Title: Books
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=bks&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICigD

Title: Maps
Link: https://maps.google.com/maps?q=java&num=5&um=1&ie=UTF-8&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICygE

Title: Videos
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=vid&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUIDCgF
...
  1. Because of this, I end up with many more links than the requested amount and I think it's also including the sub links to pages(ex: When you search for Java, the first link is java.com, with its sublinks, which are the download pages).

In short, I want to be able to filter out google links such as images, news, maps, shopping, etc... and only include the main links to pages.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

2 Answers2

1

You cannot achieve your goal with jsoup, because the HTML content returned by Google doesn't contain the search results. Everything is build from JavaScript. You can check it by running:

curl https://www.google.com/\?q\=stackoverflow\&num\=3 | grep stackoverflow

You will not find any stackoverflow URL. You can try to run the JavaScript yourself, but it will be much easier to use the Google Programmable Search Engine.

Ortomala Lokni
  • 56,620
  • 24
  • 188
  • 240
  • Thanks for your help. What I ended up doing was creating two documents. The first one used the filter of a > h3 to get the titles, and the other one was a[href] to get all titles and their respective links. I then put all the info from the second document into a hashmap, and then I used the titles from the first document and checked if a key in the hashmap contained my title, and used the value to get its respective link. Probably not the most efficient solution, but it worked for me. – Shrey Varma Mar 18 '21 at 00:07
0

My own solution:

        Elements title = doc.select("a > h3");
        Elements links = doc.select("a[href]");
        
        Map<String, String> combo = new HashMap<>();

        for (Element result : links) {
            combo.put(result.text(),result.attr("abs:href"));
        }
        for (Element result : title) {
            System.out.println("Title:"+result.text());
            String temp = getSimilar(combo, result.text());
            if(temp!=null)
                System.out.println("Link:"+temp.substring(temp.indexOf("http", temp.indexOf("http") + 1),temp.indexOf("&")));
            else
                System.out.println("No link found");
            System.out.println();
        }
        
    }

    public static String getSimilar(Map<String, String> combo, String sim){
        for (Map.Entry<String,String> key : combo.entrySet()){
            if(key.getKey().contains(sim))
                return key.getValue();
        }
        return null;
    }
  • The only problem still is, is that the number of links is always slightly off. – Shrey Varma Mar 18 '21 at 00:17
  • 1
    Once you've got your links you could always store them in a Collection, like a `List<>`, instead of directly printing them out; then filter the collection to exclude some kinds of links (images, like you mention?) and limit the total number that are output. – Stephen P Mar 18 '21 at 00:30
  • Yup I thought about doing that as well, but there were a bunch of different things that I wanted to filter out, so I just stuck with this approach. In order to fix this problem, when I asked the user how many links they want, I multiply that number by 2(because I always get less links than what I ask for) and this cause me to get a few extra links, instead of under, so I just manually break out of my second loop when it goes over. – Shrey Varma Mar 18 '21 at 00:52