There are some debates whether this sort of information can be parsed efficiently - Getting directory listing over http.
But if we examine your concrete example, we observe the following:
- your file/folder metadata are stored as
TextNode
s inside the pre
element,
- every relevant file/folder link (
a
element) has a direct sibling br
that precedes it. Well, except for the root directory: https://download.bls.gov/. You have to treat that case separately.
This constitutes enough information for efficient queries:
Document doc = Jsoup.connect("https://download.bls.gov/pub/time.series/").get();
Elements links = doc.select("pre br + a");
List<TextNode> metaData = doc.select("pre").textNodes();
for (int i = 0; i < links.size(); i++) {
String metaDataRow = metaData.get(i).toString();
System.out.println(metaDataRow + " | " + links.get(i));
}
You can further split up the metaDataRow
to extract timestamps like so:
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("M/d/yyyy pph:m a", Locale.ENGLISH);
// ...
String[] metaColumns = metaDataRow.split(" ");
LocalDate lastUpdated = LocalDate.parse(metaColumns[0].strip(), formatter);