Questions tagged [behemoth]

Behemoth is an open source platform for large scale document processing based on Apache Hadoop.

About

Behemoth is an open source platform for large scale document processing based on Apache Hadoop.

Architecture

It consists of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide reusable modules for : - ingesting from common data sources (Warc, Nutch, etc...) - text processing (Tika, UIMA, GATE, Language Identification) - generating output for external tools (SOLR, Mahout)

Its modular architecture simplifies the development of custom annotators based on MapReduce.

Note that Behemoth does not implement any NLP or Machine Learning components as such but serves as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.

References

5 questions
1
vote
1 answer

Index GATE annotations in SOLR

I need to index all the annotations and features generated after GATE processing into SOLR. I need to search upon annotations as well as features. What is the best way to do this? I would prefer moving processing to hadoop. I am using behemoth at…
madzie
  • 47
  • 1
  • 9
1
vote
1 answer

Slf4j compatibility issues between solr and hadoop

I am using behemoth solr on hadoop, and I am getting a conflict in the slf4j versions. Solr 3.6.2 uses slf4j-api-1.6.1 and hadoop 1.0.4 has libraries for slf4j-api-1.4.3. Due to this, I am unable to run the behemoth solr jar file on hadoop. What is…
madzie
  • 47
  • 1
  • 9
0
votes
1 answer

impl.ConcurrentUpdateSolrServer: Status for: {file-path}is 404

I want to index my a corpus using solr. To create a sequence file, I used the following command: ./behemoth -i file://path/to/my/file/where/the corpus/is/located -o /user/user-name/file-to-which-the-putput-is-stored After this I gave the…
0
votes
2 answers

Error in generating Behemoth corpus

I am new to hadoop and behemoth and I followed the tutorial on https://github.com/DigitalPebble/behemoth/wiki/tutorial to generate a behemoth corpus for a text document, using the following command: sudo bin/hadoop jar…
madzie
  • 47
  • 1
  • 9
0
votes
2 answers

Error in configuring object when converting intoTika using Behemoth and map reduce

I am running the command to convert behemoth corpus to tika using map reduce as given in this tutorial I am getting following error on doing it: 13/02/25 14:44:00 INFO mapred.FileInputFormat: Total input paths to process : 1 13/02/25 14:44:01…
Shrey Shivam
  • 1,107
  • 1
  • 7
  • 16