0

I have an ANSI-encoded CSV file that contains a number of 'problem' special characters. I'm looking for a script (preferably php or javascript) that I can use to check each record in the CSV and identify those that have problem characters.

I have no trouble looping through the CSV records, so I'm just looking for a good way to determine whether a a single string contains any characters that would cause problems if the string was inserted directly into a UTF-8 encoded file.

Background: I used a script to convert an ANSI CSV directly to UTF-8 XML without taking care to convert the CSV to UTF-8, first. Boneheaded move on my part. The script created XML entities for records with problem characters, but all textNodes into which the script tried to insert text with problem characters ended up empty. What I'm looking for, now, is a way to parse the original CSV file and identify all records containing problem characters. With ~18,000 records, it's not a job that I'd like to do manually :-)

Clarification I should have first converted the ANSI CSV to UTF-8, then run my 'convert to XML' script on the UTF-8 encoded CSV file. Instead, I skipped the first step and ran my 'convert to XML' script on the ANSI encoded CSV file. XML entities were created for all cells, but the XML entities for cells with characters such as — (em dash) and ½ (one half) were all empty. The 'convert to XML' script silently failed to insert these strings into the UTF-8 encoded XML document (using DOMDocument in PHP).

tex
  • 2,756
  • 22
  • 31
  • what is a " 'problem' special character" and as utf-8 covers all of ansi, how can any problem arise converting from ansi to utf-8 (using the word 'convert' very loosely) –  Aug 28 '12 at 09:05
  • Examples of actual 'problem' characters in this file: — (em dash), ½ (one half). The original problem occurred because I tried to insert the ANSI-encoded text into a PHP XML DOMDocument that was UTF-8 encoded. The script that did this work 'silently' failed to insert text containing such characters into textNodes. – tex Aug 28 '12 at 09:08
  • Looking at the answer here: http://stackoverflow.com/questions/323741/what-could-go-wrong-if-i-convert-ansi-encoded-files-to-utf-8?rq=1 It looks like I may be able to simply use PHP's `ord()` to identify strings containing characters that fall outside of the 0-127 range. Will report back if that does the trick. – tex Aug 28 '12 at 09:27
  • @Tex Are you sure your CSV is in "true" ANSI? I just tried to insert the (one half) character into an ANSI (936) text file in Notepad2 and it failed. – Passerby Aug 28 '12 at 09:29
  • @Passerby I'm going by what Notepad++ reports. It's set to automatically detect the encoding of opened text files, and it reports that the file is ANSI-encoded. I haven't dug deeper than that. – tex Aug 28 '12 at 09:38
  • @Tex Auto detect doesn't always work; In fact, I have Notepad++ UTF-8 and Notepad2 at hand, and I tried both of them to open a GB18030 file with a (one half) character in it. Both of them can not "initial" correctly. Back to your topic, if you can find out the real encoding of your CSV, `iconv` may be really helpful here. – Passerby Aug 28 '12 at 09:47

1 Answers1

0

Folks, this is quick and dirty, but that's the kind of solution I needed in this situation. I used the following code to scan through the original CSV, looking at each character in each row. Any row with a characater with ord() > 127, I inserted into a second CSV. This new CSV file contained only the rows that had 'special' characters.

In this particular case, my original CSV was larger than 5MB, and the new CSV containing only rows with special characters was much smaller, on the order of a couple hundred KB, which made it much easier to work with.

$input_file  = fopen($input_filePath, 'rt');
$output_file = fopen($output_filePath, 'w');

// Get the column headers of the file
$headers = fgetcsv($input_file);

// Loop through each row
while (($row = fgetcsv($input_file)) !== FALSE)
{
  // Loop through each cell
  foreach ($headers as $i => $header)
  {
    $cell = $row[$i];
    // Loop through each char until we find a 'special' char
    // or reach the end of the cell, whichever comes first
    for ($j = 0; $j < strlen($cell); $j++)  {   
      if (ord(substr($cell, $j, 1)) > 127) {
        // If we find a special char, add this row to the new CSV file
        fputcsv($output_file, $row);
        break;
      }
    }
  }
}
tex
  • 2,756
  • 22
  • 31