0

I index wikipedia dump file to solr with this format:

<page>
    <title>Bruce Willis</title>
    <ns>0</ns>
    <id>64673</id>
    <revision>
      <id>789709463</id>
      <parentid>789690745</parentid>
      <timestamp>2017-07-09T02:27:39Z</timestamp>
      <contributor>
        <username>Materialscientist</username>
        <id>7852030</id>
      </contributor>
      <comment>imdb is not a reliable source</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="57375">{{Use mdy dates|date=March 2012}}
{{Infobox person
 | name = Bruce Willis
 | image = Bruce Willis by Gage Skidmore.jpg
 | caption = Willis at the 2010 [[San Diego Comic-Con]].
 | birth_name = Walter Bruce Willis
 | birth_date = {{Birth date and age|1955|3|19}} 
| 
 | birth_place = [[Idar-Oberstein]], West Germany
 | nationality = [[American people|American]]
 | residence = [[Los Angeles]], [[California]], U.S.

And the schema file of the core:

<fieldType name="string" class="solr.StrField"/>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>

    <field name="id" type="string" indexed="true" stored="true" required="true"/>
    <field name="_version_" type="long" indexed="true" stored="true"/>
    <field name="TITLE" type="text_wiki" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
    <field name="REVISION_TEXT" type="text_wiki" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
    <field name="REVISION_TIMESTAMP" type="date" indexed="true" stored="true" multiValued="true" />
    <field name="CONTRIBUTOR_ID" type="int" indexed="true" stored="true" multiValued="true" />
    <field name="CONTRIBUTOR_USERNAME" type="string" indexed="true" docValues="true" stored="true" multiValued="true" />

    <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/>
    <uniqueKey>id</uniqueKey>

I did not post all content of schema.xml. I know we can use solr to get the score or similarity. Similarity is calculated based on (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)). I think page rank is based on number of incoming and outgoing pages. But with this typeField I cannot retrieve incoming and outgoing pages.

So I don't know how to calculate pagerank using solr. Did I understand wrong? Could you give me some advice if you know how to do this? Thanks

Cocoa3338
  • 95
  • 1
  • 2
  • 12

1 Answers1

0

Depending on how advanced you want the pagerank to be. If you only want to consider the number of inbound links, you can calculate it by extracting a list of pages that a page links to when indexing. You then iterate over your stored pages and select the count of documents that link to the page you're looking at, storing a new field with the number of documents that link to that page. Sort by this score (or use it for boosting, etc.) to affect the list of results returned.

MatsLindh
  • 49,529
  • 4
  • 53
  • 84
  • Thanks. But could you tell me how to extract a list of pages that a page links to when indexing. Because what I need is only Wikipedia pages, while in one webpage there are many other links as well, for example, articles or news. – Cocoa3338 Jul 28 '17 at 10:05
  • for example, [link](https://en.wikipedia.org/wiki/Bruce_Willis) I need the link which will jump to another wikipedia page. For example, "Emma Heming". But not "Bruce Willis Emmy Award Winner"(at the buttom of the page) – Cocoa3338 Jul 28 '17 at 10:10
  • That will depend on the markup of Wikipedia, but IIRC, you can use anything inside `[[]]` to denote a link to a page with that name? `[[Idar-Oberstein]]` links to the "Idar-Oberstein" page. The wikipedia markup uses `|` to provide a readable name behind the page name as well, but it's the part in front of `|` which is interesting for detecting links. – MatsLindh Jul 28 '17 at 10:13