I am creating a simple program that takes a search input from the user and the number of links that they want to receive. However, my code has 2 problems.
When I print out the links, it also includes links for images, news, etc... I was wondering how I could only keep links related to the search.
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num; Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get(); Elements results = doc.select("a[href]");
However, filtering the document using
a[href]
includes all links on the page. I also tried usingh3.r > a
which returned no results at all and alsoa > h3
, which fixed my current problem, but it would only display the title and not the actual link. Here is part of the output that I want to get rid of:
Title: Google
Link: https://www.google.com/?sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQOwgC
Title: Google
Link: https://www.google.com/?num=5&output=search&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQPAgE
Title: News
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=nws&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICCgB
Title: Images
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=isch&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICSgC
Title: Books
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=bks&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICigD
Title: Maps
Link: https://maps.google.com/maps?q=java&num=5&um=1&ie=UTF-8&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUICygE
Title: Videos
Link: https://www.google.com/search?q=java&num=5&source=lnms&tbm=vid&sa=X&ved=0ahUKEwifoOa-p7bvAhUDqlkKHS8fCsIQ_AUIDCgF
...
- Because of this, I end up with many more links than the requested amount and I think it's also including the sub links to pages(ex: When you search for Java, the first link is java.com, with its sublinks, which are the download pages).
In short, I want to be able to filter out google links such as images, news, maps, shopping, etc... and only include the main links to pages.