I have a script that counts the frequency of words and saves the results to a json file. When it is run again, it reads the existing json file, combines the results, then re-writes the json file. This can happen repeatedly within a request, and there can be many simultaneous requests, so I used flock()
to try to prevent errors. I let it run for a while yesterday and got good data but this morning I checked and the file was corrupt. (still a good text file, but the json was broken.)
Here are the relevant parts of my code:
if(is_file('/home/myuser/public_html/word_counts.json'))
{
$prevoius_counts=json_decode(file_get_contents('/home/myuser/public_html/word_counts.json'),true);
}
if(!$prevoius_counts)
{
$prevoius_counts=array();
}
$new_counts=count_words($item->Description,$item->IDENTIFIER); //Creates an array like: array('the'=>20,'it'=>15,'spectacular'=>1);
$combined_counts=sum_associatve(array($new_counts,$prevoius_counts)); like array_merge, but sums duplicate keys instead of overwriting.
$fh=fopen('/home/myuser/public_html/word_counts.json','c'); //this will always be over-written with new data, but the "c" prevents it from being truncated to 0 bytes
if (flock($fh, LOCK_EX))
{
fwrite($fh, json_encode($combined_counts));
flock($fh, LOCK_UN); // release the lock
}
fclose($fh);
function count_words($description,$unique=null){
// /([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|(?<=\D)[:,.\-]|[:,.\-](?=\D))/
// /([\s\-_,:;?!\/\(\)\[\]{}<>\r\n"]|(?<!\d)\.(?!\d))/
// http://rick.measham.id.au/paste/explain.pl?regex=
// http://stackoverflow.com/questions/20006448
$to_be_counted=strtolower($description);
$to_be_counted.=' BLOCKS '.$unique;
$words=preg_split('/([\s_;?!\/\(\)\[\]{}<>\r\n"]|\.$|(?<=\D)[:,.]|[:,.](?=\D))/', $to_be_counted, null, PREG_SPLIT_NO_EMPTY);
return array_count_values ($words);
}
function sum_associatve($arrays){
$sum = array();
foreach ($arrays as $array) {
foreach ($array as $key => $value) {
if (isset($sum[$key])) {
$sum[$key] += $value;
} else {
$sum[$key] = $value;
}
}
}
return $sum;
}
Since it works for a while but finally writes bad json, I don't know if it's a file locking problem, or if I have some issue where the json_encode is returning bad data...?