I'm working with Heritrix and I'm a bit stuck with managing its output.
I'm studying PageRank and I need Heritrix to generate a file against which to apply the ranking algorithm. The file that I need shall have only links and outlinks for each visited page.
I would like to avoid (as much as I can) postprocessing. Is it possible to customize Heritrix's output by specifying what shall be included and what shall not? I have alredy tried to modify cxml File but there are still a lot of unhelpful information in the output (like the content page).