-1

I am reading a flat file (fixed width fields) so that I can import it into SQLServer. This is a new project in my company. I have not yet worked on parsing huge files in the most time-efficient and quick way. Scouring on the Internet, I have found a powershell script which can parse a fixed-width file... It does the job of parsing the file.

[void][System.Reflection.Assembly]::LoadWithPartialName('Microsoft.VisualBasic') 

$Parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser( 
'C:\T1046436.dat') 

$Parser.TextFieldType = 'FixedWidth' 
$Parser.TrimWhiteSpace = $False 
$Parser.FieldWidths = @(13,6,11,2,10,10,1,1,7,7,2,17,1,2,2,4,1,10,10,10,10,2,10,10,11,2,2,1,1) 

while(!$Parser.EndOfData) 
{ 
try 
{ 
    Write-Host $Parser.ReadFields() 
} 
catch [Microsoft.VisualBasic.FileIO.MalformedLineException] 
{ 
    Write-Host "Error, line $($_.Exception.LineNumber): $($Parser.ErrorLine)" 
} 
} 

I want to be able to save this in a pipe-delimited file, so that I can simply BCP IN the sql database. Thats my quick-dirty way to load data. But, it is using lots of time (20 minutes for 50000 records).

Any suggestions to get it quickly/ efficiently:

  • converted into pipe-delimited file OR
  • directly import the flat file into SQL server from powershell... OR
  • I am open to use any other scripting language which can help parse the flat-file quickly into a pipe-delimited file (python, perl etc) under Windows. Any example scripts to use will be appreciated.
Maciej Gol
  • 15,394
  • 4
  • 33
  • 51
DataRiver
  • 188
  • 1
  • 8
  • 21
  • You should include sample .dat file and sample of wanted result(ex. you're not saving anything in the script above). Also, you should NEVER use `Write-Host` to output data. `Write-Host` is for text, and will only work in an interactive sessions(console). Just use `$Parser.ReadFields()` or `Write-Output $Parser.Readfields()` to output the data. – Frode F. Dec 21 '13 at 21:49
  • @Graimer: The reason I did not post any sample data as I was abler to parse the data without issues (as in the data was parsed as expected). Here it is: `0027322230415367691294302160A 102012-05-082012-06-143342731 ~ 0000000001929.51 39070000~1134117849~ 1750319562~ 2 2012-09-072012-09-10294302160A ~ ~ 95 0030799704021360025281074647A 402013-03-252013-03-2913V5789 ~ N 0000000000000.00 3930000031801884655~ 1851370175~ 1 2013-05-032013-05-06281074647A ~ ~ 00 0028707039287360025580907416A 402012-11-282012-11-281353510 ~ 0000000001111.17` – DataRiver Dec 21 '13 at 21:54
  • Expected output is the same data in a pipe-delimited format. – DataRiver Dec 21 '13 at 21:55
  • Can you please share an example to save data in a file using write-output? – DataRiver Dec 21 '13 at 21:55
  • Powershell isn't meant for processing data in volumes like this. Do you know any other programming languages? You have tagged your question with both Perl and Python. If you can use either of these then your program would run very much faster – Borodin Dec 21 '13 at 22:02
  • @Borodin: Yes, I understand powershell is not the way to go, but, that was a quick/ dirty step for me to get started. Can you help with any example of perl/ python? I am not an expert at Perl/ Python, but, can figure out things if I can get a head-start.... Any help is appreciated. – DataRiver Dec 21 '13 at 22:06
  • To save you could do something like `$Parser.ReadFields() -join "|" | Out-File test.csv` Since you're using `.Net` classes anyways, why not just compile a console application? You've almost completed it already with the code above. – Frode F. Dec 21 '13 at 22:16
  • What languages *do* you know? Write something in Perl, Python, or a sensible language of your choice. If you can't get it to work then show it in a new question and you will get help to finish it. It isn't fair to come here expecting us to do your work. – Borodin Dec 21 '13 at 22:21
  • Well, nothing to do with answering the question, but this question is the only place where I seen the use of `TextFieldParser` with `FixedWidth`, so thanks! Upvoted the question just for that. – curropar Mar 28 '19 at 13:10

1 Answers1

0

This Powershell script will parse a collection of $files containing records of 5 fields, each having 10 characters, and output those records with the fields pipe-delimited.

#Create a regular expression to match the field widths and capture the data.
$regex = [regex]'(.{10})(.{10})(.{10})(.{10})(.{10})'

#create a filter to insert a pipe character between the captured groups.
filter PipeDelimit {$_ -replace $regex, '$1|$2|$3|$4|$5'}


#Pipe the records thorough the filter in batches of 1000 
Get-Content $files -ReadCount 1000 | Pipedelimit

You'll need to modify the regex and filter to match your data. I suspect it will take considerably less than 20 minutes to chew through half a million of those records.

The -Readcount will control memory usage by keeping only 1000 records at a time in the pipeline. They will be passed to the pipeline as an array, and the -replace operator in the filter will delimit the entire array in one operation, without needing to foreach through each record. The filter is admittedly unusual, and could be replaced with foreach-object, but the filter is marginally faster and it adds up if you're doing lots of reps.

mjolinor
  • 66,130
  • 7
  • 114
  • 135
  • That is presumably PowerShell? You should say so, and provide at least a little commentary on the code. – Borodin Dec 21 '13 at 22:37
  • Much. Thank you. I would perfect it by using markdown to identify code within your paragraphs, but this is pretty good. – Borodin Dec 22 '13 at 00:21
  • Will that get close to what Perl or Python will do? – mjolinor Dec 22 '13 at 05:00
  • @mjolinor: This is awesome. I can run through the 50000 records file in 2 minutes... When I redirect output to a text, its less than 30 seconds for 50000 records. – DataRiver Dec 22 '13 at 12:59
  • What is a quick/ dirty fix to write a static number towards the end of each row (which represents a batchid)? – DataRiver Dec 23 '13 at 13:32
  • There's probably two or three different ways to do that, and that proably needs to be a new question so other people can weigh in with answers. – mjolinor Dec 23 '13 at 13:36
  • Sure thing. will create another question to get feedback. – DataRiver Dec 23 '13 at 15:34