0

In a paper I am working on for my Uni project, I need to analyze maven packages. I need the pom files of these packages to then parse them and get the necessary data to create a graph and analyze it. The supervisor told me to download the latest index that is posted by maven central which he said that it should contain around 10k packages or so, so a good dataset to work with. I followed the steps on this site https://maven.apache.org/repository/central-index.html but frankly, I am quite lost on how to use this index to get the pom files. This is the screen that I get after finishing the steps from the website above: luke index browser

I looked all over the internet but found nothing on this Luke Lucene project and documentation seems extremely old. Are there some other ways I can download pom files from packages released in this index? https://repo.maven.apache.org/maven2/.index/

Denxah129
  • 23
  • 6
  • Note that downloading all POM files from MavenCentral is not possible because they are way too many. – J Fabian Meier Jun 04 '22 at 15:35
  • 1
    This appears to be very similar to a question you have already asked: [How to get all the dependencies of Maven packages](https://stackoverflow.com/q/72151893/12567365). – andrewJames Jun 04 '22 at 16:37
  • 2
    Also, this new question appears to be mixing up two completely different steps: (1) Downloading Maven index data; (2) Using Luke to explore Indexed Lucene data (the index data after completion of step 1, as described in the page you link to). Luke is a tool which is bundled in all recent binary distributions of Lucene - it has nothing to do with downloading data (including POMs) from Maven. – andrewJames Jun 04 '22 at 16:37
  • First, I do not need to download all POM files, just some of them from an incremental that is posted weekly I believe. Second, to respond to @andrewJames, I thought that those steps guide you to download the data that can be found in the index https://repo.maven.apache.org/maven2/.index/. That is why I wondered how to use Luke. But if Luke does not have anything to do with my goal of downloading the POM files of the latest incremental, then how should I proceed? Thanks! – Denxah129 Jun 04 '22 at 18:41
  • Initially, I thought I need to analyze everything. That is why this question is very similar to one in the past, but this time I have advanced quite a bit in my research with this and I only need to analyse a dataset of 10k or more packages, not the whole 400K+ or however many there exist in Maven. If downloading POM files from the latest incremental is not achievable, maybe there is some other dataset I can download from somewhere? – Denxah129 Jun 04 '22 at 18:51
  • 1
    The approach in your question allows you to use Lucene (e.g. via Luke) to _search_ for terms in a downloaded copy of the central index data. None of that will give you the POMs for packages. I think instead you want to trawl through a _listing_ of packages and get the POMs from that listing. Maybe you can start [here](https://repo1.maven.org/maven2/) and drill down into a subset of these top-level entries. – andrewJames Jun 04 '22 at 20:19
  • 1
    If you do this manually (just click the links) for the first entry (HTTPClient), you will eventually see the link for [HTTPClient-0.3-3.pom](https://repo1.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.pom). Now you can automate this link crawling (you can do some research for how), and you can stop the crawl when you have downloaded 10k POM files. Whether this is allowed/supported by Maven, I have no idea. – andrewJames Jun 04 '22 at 20:19
  • I think I will try this then, hopefully they wont knock on the door saying this is not allowed. Thanks a lot! – Denxah129 Jun 04 '22 at 20:40

0 Answers0