I'm implementing a web scraper in Java. After playing a little with websites that I'm going to crawl, I want to use best practice for concurrent HTTP connections in Java. I'm currently using Jsoup's connection method. I'd like to know if it's possible to create threads and make connections inside those threads similarly to HttpAsyncClient.
1 Answers
Jsoup does not use HttpAsyncClient. Jsoup's Jsoup.connect(String url)
method uses blocking URL.openConnection()
method.
If you want to use Jsoup asynchronously you can parallel all Jsoup.connect()
executions. In Java 8 you can use parallel stream to do so. Let's say you have a list of URLs you want to scrape in parallel. Take a look at following example:
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutionException;
import java.util.stream.Collectors;
public class ConcurrentJsoupExample {
public static void main(String[] args) throws ExecutionException, InterruptedException {
final List<String> urls = Arrays.asList(
"https://google.com",
"https://stackoverflow.com/questions/48298219/is-there-a-difference-between-httpasyncclient-and-multithreaded-jsoup-connection",
"https://mvnrepository.com/artifact/org.jsoup/jsoup",
"https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#openConnection()",
"https://docs.oracle.com/javase/7/docs/api/java/net/URLConnection.html"
);
final List<String> titles = urls.parallelStream()
.map(url -> {
try {
return Jsoup.connect(url).get();
} catch (IOException e) {
return null;
}
})
.filter(Objects::nonNull)
.map(doc -> doc.select("title"))
.map(Elements::text)
.peek(it -> System.out.println(Thread.currentThread().getName() + ": " + it))
.collect(Collectors.toList());
}
}
Here we have 5 URLs defined and the goal of this simple application is to get a text value of <title>
HTML tag from these websites. What happens is we create parallel stream using list of URLs and we map each URL to Jsoup's Document
object - .get()
method throws checked exception so we have to try-catch it and if exception occurs we return null
value. All null
values gets filtered by .filter(Objects::nonNull)
and after that we can extract elements we need - text value of <title>
tag in this case. I also added .peek()
that prints what is the value extracted and what is the thread name it runs on. Exemplary output may look like this:
ForkJoinPool.commonPool-worker-1: java - Is there a difference between HttpAsyncClient and multithreaded Jsoup connection class? - Stack Overflow
main: Maven Repository: org.jsoup » jsoup
ForkJoinPool.commonPool-worker-4: URL (Java Platform SE 7 )
ForkJoinPool.commonPool-worker-2: URLConnection (Java Platform SE 7 )
ForkJoinPool.commonPool-worker-3: Google
In the end we call .collect(Collectors.toList())
to terminate stream, execute all transformations and return a list of titles.
It is just a simple example, but it should give you a hint how to use Jsoup in parallel.
Alternatively you can use url.parallelStream().forEach()
if functional-like approach does not convince you:
urls.parallelStream().forEach(url -> {
try {
final Document doc = Jsoup.connect(url).get();
final String title = doc.select("title").text();
System.out.println(Thread.currentThread().getName() + ": " + title);
// do something with extracted title...
} catch (IOException e) {
e.printStackTrace();
}
});

- 40,216
- 10
- 104
- 131
-
Thanks for the implementation! From a broader perspective, you're saying that HTTPAsyncClient isn't doing anything different than regular multithreading, right? I asked this question mainly because I thought HTTPAsyncClient could be doing something to modify particular network settings and packages sent/received etc. – Mert Akozcan Jan 17 '18 at 11:13
-
Well, the main purpose of HTTPAsyncClient is to execute non-blocking requests (in multi-threaded manner in this case), but you can also e.g. pipe multiple requests etc. Jsoup's implementation does not use any asynchronous execution methods, so the only way is to parallelize execution to get this async feeling. And it's very useful when doing scraping, because you don't want to limit yourself to a single thread when multiple executions can happen in parallel. – Szymon Stepniak Jan 17 '18 at 12:02