0

I am sorry if this question as been asked before, but I couldn't find anything similar.

I am receiving CSV output that uses " as a text qualifier around every field. I am looking for an elegant solution to reformat these so that only specific (alphanumeric fields) have these qualifiers.

An example of what I am receiving:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"

My desired output would be this:

"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25

Any suggestions or assistance are greatly appreciated!

Per request below find the first five lines of the example file:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","Recruit-Navy,XL#28-75","13.25","13.25"
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","High Peak-Navy,XL#21-18","36.75","36.75"
"TRI-MOUNTAIN/MOUNTAI","F257186","Z1023384","","10/15/14","","1","Patriot-Red,L#26-35","25.50","25.50"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-Red/Gray,S#23-52","19.75","19.75"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-White/Gray,XL#23-56","19.75","19.75"

Note that this is only an example and not all files will be for Tri-Mountain.

Jeff
  • 3
  • 2

2 Answers2

0

This problem presents the difficulty of separating quotes from comma separated fields where the fields themselves contain embedded commas. (e.g.: "Recruit-Navy,XL#28-75") There are a number of ways to approach this from a shell standpoint (while read, awk, etc.) but most will ultimately stumble on the embedded comma.

One approach found to be successful was a brute force character-by-character parse of the line. (below) This isn't an elegant solution, but it will get you started. The other alternative to a shell program would be a compiled language, such as C, where character handling is a bit more robust. Leave a comment if you have questions.

#!/bin/bash

declare -a arr
declare -i ct=0

## fill array with separated fields (preserving comma in fields)
#  Note: the following is a single-line (w/continuations for readability)
arr=( $( line='"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"'; \
for ((i=0; i < ${#line}; i++)); do \
    if test "${line:i:1}" == ',' ; then \
        if test "${line:i+1:1}" == '"' -o "${line:i-1:1}" == '"' ; then \
            printf " "; \
        else \
            printf "%c" ${line:i:1}; \
        fi; \
    else \
        printf "%c" ${line:i:1}; \
    fi; \
done; \
printf "\n" ) )

## remove quotes from non-numeric fields
for i in "${arr[@]}"; do 
    if [[ "${i:0:1}" == '"' ]] && [[ ${i:1:1} == [0123456789] ]]; then
        arr[$ct]="${i//\"/}"
    else
        arr[$ct]="$i"
    fi
    if test "$ct" -eq 0 ; then
        printf "%s" "${arr[ct]}"
    else
        printf ",%s" "${arr[ct]}"
    fi
    ((ct++))
done

printf "\n"

exit 0

output

$ bash sepquoted.sh
"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25

original

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"
David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
0

Since you've not specified OS or language, here is the PowerShell version.

I've ditched my previous attempt to work with Import-CSV because of your non-standard CSV files and switched to raw file processing. Should be significantly faster too.

Regex to split CSV is from this question: How to split a string by comma ignoring comma in double quotes

Save this script as StripQuotes.ps1. It accepts following arguments:

  • InPath: folder to read CSVs from. If not specified, the current directory is used.
  • OutPath: folder to save processed CSVs to. Will be created, if not exist.
  • Encoding: If not specified, script will use system's current ANSI code page to read the files. You can get other valid encodings for your system in PowerShell console like this: [System.Text.Encoding]::GetEncodings()
  • Verbose: script will tell you what's going on via Write-Verbose messages.

Example (run from the PowerShell console).

Process all CSVs in the folder C:\CSVs_are_here, save processed CSVs to the folder C:\Processed_CSVs, be verbose:

.\StripQuotes.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Verbose

StripQuotes.ps1 script:

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default'
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Matches a single code point in the category "letter".
$AlphaNumRegex = '\p{L}'

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp = $line -split $SplitRegex |
                        ForEach-Object {
                            # Strip double quotes, if any
                            $item = $_.Trim($DQuotes)

                            if($_ -match $AlphaNumRegex)
                            {
                                # If field has at least one letter - wrap in quotes
                                $DQuotes + $item + $DQuotes
                            }
                            else
                            {
                                # Else, pass it as is
                                $item
                            }
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }
Community
  • 1
  • 1
beatcracker
  • 6,714
  • 1
  • 18
  • 41
  • Thanks for the suggestion! I am receiving an error when trying this: Import-Csv : Cannot process argument because the value of argument "name" is invalid. Change the value of the "name" argument and run the operation again. At C:\psigenoutput\convertcsv2.ps1:7 char:11 + Import-Csv <<<< C:\psigenoutput\psdet.csv | + CategoryInfo : InvalidArgument: (:) [Import-Csv], PSArgumentException + FullyQualifiedErrorId : 'Argument,Microsoft.PowerShell.Commands.ImportCsvCommand – Jeff Mar 06 '15 at 16:25
  • Hmm, do your CSV files have headers? Because it seems that [they don't](http://pinchii.com/2011/09/cannot-process-argument-because-the-value-of-argument-name-is-invalid/), or have some blank columns at the beginning. If you update your question with an example, e.g. like first 5 lines from actual file, I'll be able to fix the script. – beatcracker Mar 06 '15 at 16:44
  • @Jeff Btw, I can make script to scan folder for CSVs and save processed ones alongside or in subfolder, just name it. Because currently it means you have to manually edit script to process new file and that's boring. What's your workflow with those files? – beatcracker Mar 06 '15 at 16:53
  • The original csv files are generated and placed into a folder with static names psdet.csv and pshead.csv. I can have these files generated in any folder it doesn't really matter. I need the script to pick up the files out of the folder and place them into a different folder with the same names once they have been corrected. I'll paste the first five lines of the file for you in the original question. – Jeff Mar 06 '15 at 21:53
  • I will test the updated script as soon as possible and let you know how it goes. Thanks! – Jeff Mar 11 '15 at 14:28