3

I have large CSV files (50-500 MB each). Running complicated power shell commands on these takes forever and/or hits memory issues.

Processing the data requires grouping by common fields, say in ColumnA. So assuming that the data is already sorted by that column, if I split these files randomly (i.e. each x-thousand lines) then matching entries could still end up in different parts. There are thousands of different groups in A, so splitting every one into a single file would create to many files.

How can I split it into files of 10,000-ish lines and not lose the groups? E.g. rows 1-13 would be A1 in Column A, rows 14-17 would be A2 etc. and row 9997-10012 would be A784. In this case i would want the first file to contain rows 1-10012 and the next one to start with row 10013.

Obviously I would want to keep the entire rows (rather than just Column A), so if I pasted all the resulting files together this would be the same as the original file.

Rob
  • 1,048
  • 4
  • 11
  • 22
  • If its easier to implement then I dont mind 10,000 being the max rather than the min. So in example above first file will be rows 1-9996. – Rob Feb 07 '13 at 16:32

4 Answers4

5

Not tested. This assumes ColumnA is the first column and it's common comma-delimited data. You'll need to adjust the line that creates the regex to suit your data.

 $count = 0

 $header = get-content file.csv -TotalCount 1

 get-content file.csv -ReadCount 1000 |
  foreach {
   #add tail entries from last batch to beginning of this batch
   $newbatch = $tail + $_ 

   #create regex to match last entry in this batch
   $regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0])) 

   #Extract everything that doesn't match the last entry to new file

     #Add header if this is not the first file
     if ($count)
       {
         $header |
           set-content "c:\somedir\filepart_$count"
        }

     $newbatch -notmatch $regex | 
      add-content "c:\somedir\filepart_$count"  

   #Extact tail entries to add to next batch
   $tail = @($newbatch -match $regex)

   #Increment file counter
   $count++ 

}
mjolinor
  • 66,130
  • 7
  • 114
  • 135
  • Very nice. You should however include the csvheader in all parts, at least I would've wanted that. – Frode F. Feb 07 '13 at 16:22
  • Agreed, now works well with headers. Can you add (a pipe?) for sorting the input by ColumnA, which is actually the second column of file. Have adjusted the [0] to [1] which works. – Rob Feb 07 '13 at 16:38
  • The problem states the assumption is that the data is already sorted by that column.... – mjolinor Feb 07 '13 at 16:45
  • 2
    Thanks! ("I'm stealing that!" is high praise to a scripter) – mjolinor Feb 07 '13 at 16:48
  • That's neat! What strikes me odd is: having semicolon-delimited data, I would have expected I have to adjust this to `$regex = '^' + [regex]::Escape(($newbatch[-1].split(';')[0]))`. However, I forgot that at first glance and it worked just fine with the comma... makes me wonder: what in gods name does the regex do in this case? – Martin Dreher Mar 14 '17 at 09:36
  • Just adjusted the regex to my data, e.g. used a `;` instead of a `,`. After that, not a single file has 1000 rows, but an arbitrary number - with the last one containing the remaining 8000. No big deal since the @mjolinor code worked like a charm... still quite odd. Did I miss something? I use a Swiss localization, where `;` is the standard csv separator. – Martin Dreher Mar 14 '17 at 09:43
2

This is my attempt, it got messy :-P It will load the whole file into memory while splitting it, but this is pure text. It should take less memory then imported objects, but still about the size of the file.

$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1

$start = 1

while ($start -lt $lines - 1) {
    #Set minimum $end value (last line)
    if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }

    #Value to compare. ColA is first column in my file = [0] .  ColB is second column = [1]
    $avalue = $content[$end].split(",")[0]
    #If not last line in script
    if ($end -ne $lines -1) {
        #Increase $end by 1 while ColA is the same
        while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
        #Return to last line with equal ColA value
        $end--
    }
    #Create new csv-part
    $filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
    @($csvheader, $content[$start..$end]) | Set-Content $filename

    #Fix counters
    $filepart++
    $start = $end + 1
}

file.csv:

ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
A5,1,10
A6,1,10
A7,1,10

Results (I used $minlines = 5):

file.part1.csv:

ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20

file.part2.csv:

ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60

file.part3.csv:

ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10
Frode F.
  • 52,376
  • 9
  • 98
  • 114
  • looks to be quite close, but the resulting files are twice as big as the original file and dont seem to be the same "CSV" as the original (when opening in Excel it doesnt pick up columns), looks ok in notepad though. Also, could you amend the input so it sorts by that ColumnA. Note that this isnt actually the first column of the file. – Rob Feb 07 '13 at 16:29
  • fixed filesizes. you have to replace delimiter and column-number to fit your needs. They are "," and 0 (first) in my sample(see both lines using split() and replace the values). You said the file would already be sorted. If you want to sort the columns in THIS script, the best way is the import the objects from csv. This will require more memory, which was your problem in the first place – Frode F. Feb 07 '13 at 16:49
0

This requires PowerShell v3 (due to -append on Export-CSV).

Also, I'm assuming that you have column headers and the first column is named col1. Adjust as necessary.

import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}

This will create one file for each distinct value in the first column, with that value as the file name.

alroc
  • 27,574
  • 6
  • 51
  • 97
  • thanks, but as outlined above, this would create far too many files – Rob Feb 07 '13 at 15:38
  • It's not clear to me how the data is set up. Are you saying you want multiple groups per file, but not splitting a single group across multiple files? – alroc Feb 07 '13 at 15:40
  • I want multiple groups per file, but no file to be more than 10,000-ish rows. i.e. once it reaches 10,000 keep going with that group but then stop and start next file with next group – Rob Feb 07 '13 at 15:42
0

To compliment the helpful answer from mjolinor with a reusable function with a few additional parameters and using the steppable pipeline which is about a factor 8 faster:

function Split-Content {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory=$true)][String]$Path,
        [ULong]$HeadSize,
        [ValidateRange(1, [ULong]::MaxValue)][ULong]$DataSize = [ULong]::MaxValue,
        [Parameter(Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]$Value
    )
    begin {
        $Header = [Collections.Generic.List[String]]::new()
        $DataCount = 0
        $PartNr = 1
    }
    Process {
        $ReadCount = 0
        while ($ReadCount -lt @($_).Count -and $Header.Count -lt $HeadSize) {
            if (@($_)[$ReadCount]) { $Header.Add(@($_)[$ReadCount]) }
            $ReadCount++
        }
        if ($ReadCount -lt @($_).Count -and $Header.Count -ge $HeadSize) {
            do {
                if ($DataCount -le 0) { # Should never be less
                    $FileInfo = [System.IO.FileInfo]$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($Path)
                    $FileName = $FileInfo.BaseName + $PartNr++ + $FileInfo.Extension
                    $LiteralPath = [System.IO.Path]::Combine($FileInfo.DirectoryName, $FileName)
                    $steppablePipeline = { Set-Content -LiteralPath $LiteralPath }.GetSteppablePipeline()
                    $steppablePipeline.Begin($PSCmdlet)
                    $steppablePipeline.Process($Header)
                }
                $Next = [math]::min(($DataSize - $DataCount), @($_).Count)
                if ($Next -gt $ReadCount) { $steppablePipeline.Process(@($_)[$ReadCount..($Next - 1)]) }
                $DataCount = ($DataCount + $Next - $ReadCount) % $DataSize
                if ($DataCount -le 0) { $steppablePipeline.End() }
                $ReadCount = $Next % @($_).Count
            } while ($ReadCount)
        } 
    }
    End {
        if ($steppablePipeline) { $steppablePipeline.End() }
    }
}

Parameters

Value
Specifies the listed content lines to be broken into parts. Multiple lines sent through the pipeline at a time (aka sub arrays like Object[]) will also be passed to the output file at a time (assuming that is fits the -DataSize).

Path
Specifies a path to one or more locations. Each filename in the location is suffixed with a part number (starting with 1).

HeadSize
The specifies the number of lines of the header that will be taken from the input and preceded in each file part. The default is 0, meaning no header line are copied.

DataSize
The specifies the number of lines that will be successively taken (after the header) from the input as data and pasted into each file part. The default is [ULong]::MaxValue, basically meaning that all data is copied to a single file.

Example 1:

Get-Content -ReadCount 1000 .\Test.Csv |Split-Content -Path .\Part.Csv -HeadSize 1 -DataSize 10000

This will split the .\Test.Csv file in chuncks of csv files with 10000 rows

Note that the performance of this Split-Content function highly depends on the -ReadCount of the prior Get-Content cmdlet.

Example 2:

Get-Process |Out-String -Stream |Split-Content -Path .\Process.Txt -HeadSize 2 -DataSize 20

This will write chunks of 20 processes to the .\Process<PartNr>.Txt files preceded with the standard (2 line) header format:

 NPM(K)    PM(M)      WS(M)     CPU(s)      Id  SI ProcessName
 ------    -----      -----     ------      --  -- -----------
 ... # 20 rows following
iRon
  • 20,463
  • 10
  • 53
  • 79