2

Possible Duplicate:
php: pdf to string

i am trying to save text content of pdf file in to DB. I found this link helpful Converting PDF to string, and worked on it. But it only converts very less amount of data :( why is it so ?

Or any other solution on how to convert complex pdf file (containing header, footer, tables, nd two column layout in some pages etc etc.) in to string and save it to DB ?

Community
  • 1
  • 1
atif
  • 1,693
  • 13
  • 38
  • 70
  • as far as i know theres no OCR in php, so its pretty much up to your PDF how much text can be parsed from it. – Tom Nov 07 '12 at 06:36

1 Answers1

4

A long time ago i wrote a script which download a pdf and convert it into text. This function do the convertion:

function pdf2string($sourcefile) {

$content = $sourcefile;

$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;

while ($pos !== false && $pos2 !== false) {

$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);

if ($pos !== false && $pos2 !== false){

if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}

if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}

$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;

}
}

return preg_replace('/(\s)+/', ' ', $pdfText);

}

EDIT: I call pdfExtractText() This function is defined here:

function pdfExtractText($psData){

if (!is_string($psData)) {
return '';
}

$text = '';

// Handle brackets in the text stream that could be mistaken for
// the end of a text field. I'm sure you can do this as part of the
// regular expression, but my skills aren't good enough yet.
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);

preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$psData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
// Run another match over the contents.
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
}
}

// Translate special characters and put back brackets.
$trans = array(
'...' => '…',
'\205' => '…',
'\221' => chr(145),
'\222' => chr(146),
'\223' => chr(147),
'\224' => chr(148),
'\226' => '-',
'\267' => '•',
'\374'  => 'ü',
'\344'  => 'ä',
'\247'  => '§',
'\366'  => 'ö',
'\337'  => 'ß',
'\334'  => 'Ü',
'\326'  => 'Ö',
'\304'  => 'Ä',
'\(' => '(',
'\[' => '[',
'##ENDBRACKET##' => ')',
'##ENDSBRACKET##' => ']',
chr(133) => '-',
chr(141) => chr(147),
chr(142) => chr(148),
chr(143) => chr(145),
chr(144) => chr(146),
);
$text = strtr($text, $trans);

return $text;
}

EDIT2: To get content from a local file use:

$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);

EDIT3: Before saving data to db i use an escape function:

function escape($str)
{
$search=array("\\","\0","\n","\r","\x1a","'",'"');
$replace=array("\\\\","\\0","\\n","\\r","\Z","\'",'\"');
return str_replace($search,$replace,$str);
}
Sentencio
  • 230
  • 1
  • 13
  • thanks for replying but it didn't output anything, when i use it $result = pdf2string('CROI0311.pdf'); echo $result; – atif Nov 07 '12 at 06:56
  • Hi atif. Sorry about that. I have change my Post before and add the missing function. The var `$sourcefile` in my case is no path to a pdf file. You have to insert the pdf streamdata. – Sentencio Nov 07 '12 at 07:03
  • i beleive there is still missing something as i am still getting blank page when i echo the result – atif Nov 07 '12 at 07:07
  • Do you use the streamdata of pdf in the function? – Sentencio Nov 07 '12 at 07:09
  • Yes i did use the same curl code you have posted here :$ch = curl_init('CROI0311.pdf'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 0); $res = curl_exec($ch); curl_close($ch); echo pdf2string($res); But again i am getting a blank page :( – atif Nov 07 '12 at 07:35
  • Do you have any warnings or errors? Before saving the data from pdf to db i use an escape function: `function escape($str) { $search=array("\\","\0","\n","\r","\x1a","'",'"'); $replace=array("\\\\","\\0","\\n","\\r","\Z","\'",'\"'); return str_replace($search,$replace,$str); }` – Sentencio Nov 07 '12 at 07:39
  • Nopx, no errors and no warnings but still get no output, and i am just displaying the output by "echo" rite now, will save it to DB in 2nd step – atif Nov 07 '12 at 07:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/19195/discussion-between-atif-and-sentencio) – atif Nov 07 '12 at 07:47