Hadoop: Reading only the "English" pages

Question

I am trying to read the "English" web pages from Common Crawl. I am running these Hadoop jobs in Amazon interface. Please have a look at the following code, That is the Mapper part. I have no Reducer.

#!/usr/bin/php
<?php

$word2count = array();
$counter = 0;
$closeit = false;

while (($closeit == false)&& (($line = fgets(STDIN)) !== false)) {
    $counter++;
   $line = strtolower(trim($line));
   echo "$line\n";
    if($counter > 100)
    {
    $closeit = true;
    }
}

   echo "mapper1\n";


?>

In here, this code will read the first 100 lines of the article. How can I change this so this will read only the "English" articles? Apart from that, which data set should I use?

Please help.

score 0 · Answer 1 · answered Jan 08 '14 at 17:42

0

You can use a language detector after reading a line or some lines. Here is some code depicting how to do it in PHP: http://phpir.com/language-detection-with-n-grams it is already configured to detected certain languages including English.

answered Jan 08 '14 at 17:42

ADR

16

Hadoop: Reading only the "English" pages

1 Answers1