1

I am crawling using Heritrix 3.1.0. I am trying to save the files using the MirrorWriterProcessor. However, this option is not available in the crawler-beans.cxml.

What I did was to replace the "warcWriter" "org.archive.modules.writer.WARCWriterProcessor" to "org.archive.modules.writer.MirrorWriterProcessor"

However, this processor write the mirror content to $HERITRIX_HOME/mirror

I configured the "path" to "${launchId}/mirror", hoping Heritrix to write the mirror directory to under the job directory.

What shall I do to change the path of MirrorWriterProcessor to under the job directory?

fanchyna
  • 2,623
  • 7
  • 36
  • 38

1 Answers1

0

You cannot, at the moment, use tags like the ones warcWritter accepts. You can, however, write some spring magic to create your own stamped folders. This creates a factory for the format function of SimpleDateFormat and spits out a string you can use to create a stamped folder.

<bean id="dateFormat" class="java.text.SimpleDateFormat">
  <constructor-arg value="ddMMyyyy" />
</bean>
<bean id="formatedDate" factory-bean="dateFormat" factory-method="format">
  <constructor-arg>
    <bean class="java.util.Date" />
  </constructor-arg>
</bean>
<bean id="mirrorWriter" class="org.archive.modules.writer.MirrorWriterProcessor">
  <property name="path">
    <bean class="java.lang.String">
      <constructor-arg value="#{formatedDate + '/mirror'}" />
    </bean>
  </property>
...
Nielsvh
  • 1,151
  • 1
  • 18
  • 31