Heritrix: how to exclude everything but pdf from mirroring?

Question

I found this topic How do i exclude everything but text/html from a heritrix crawl?

I have changed bean to this

 <property name="shouldProcessRule">
  <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
    <property name="decision" value="ACCEPT" />
    <property name="regex" value="^application/pdf.*"/>
  </bean>
</property>

</bean>

But heritrix still saves every file to mirror dir.

score 0 · Answer 1 · answered Jul 22 '13 at 22:05

I believe you are missing a reject rule above your accept rule. I have the following that works:

<property name="shouldProcessRule">
  <bean class="org.archive.modules.deciderules.DecideRuleSequence">
    <property name="rules">
      <list>
        <bean class="org.archive.modules.deciderules.RejectDecideRule">
        </bean>
        <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
          <property name="decision" value="ACCEPT" />
          <property name="regex" value="^application/pdf.*"/>
        </bean>
      </list>
    </property>
  </bean>
</property>

This rejects everything, then accepts everything listed in the following rules.

Heritrix: how to exclude everything but pdf from mirroring?

1 Answers1