I am trying to read the "English" web pages from Common Crawl
. I am running these Hadoop
jobs in Amazon interface. Please have a look at the following code, That is the Mapper part. I have no Reducer.
#!/usr/bin/php
<?php
$word2count = array();
$counter = 0;
$closeit = false;
while (($closeit == false)&& (($line = fgets(STDIN)) !== false)) {
$counter++;
$line = strtolower(trim($line));
echo "$line\n";
if($counter > 100)
{
$closeit = true;
}
}
echo "mapper1\n";
?>
In here, this code will read the first 100 lines of the article. How can I change this so this will read only the "English" articles? Apart from that, which data set should I use?
Please help.