0

I found this topic How do i exclude everything but text/html from a heritrix crawl?

I have changed bean to this

 <property name="shouldProcessRule">
  <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
    <property name="decision" value="ACCEPT" />
    <property name="regex" value="^application/pdf.*"/>
  </bean>
</property>

</bean>

But heritrix still saves every file to mirror dir.

Community
  • 1
  • 1
hudvin
  • 63
  • 1
  • 7

1 Answers1

0

I believe you are missing a reject rule above your accept rule. I have the following that works:

<property name="shouldProcessRule">
  <bean class="org.archive.modules.deciderules.DecideRuleSequence">
    <property name="rules">
      <list>
        <bean class="org.archive.modules.deciderules.RejectDecideRule">
        </bean>
        <bean class="org.archive.modules.deciderules.ContentTypeMatchesRegexDecideRule">
          <property name="decision" value="ACCEPT" />
          <property name="regex" value="^application/pdf.*"/>
        </bean>
      </list>
    </property>
  </bean>
</property>

This rejects everything, then accepts everything listed in the following rules.

Nielsvh
  • 1,151
  • 1
  • 18
  • 31