Removing CR LF improper split line on .txt pipe-delimited flat with Powershell script

Question

Hope all is well! I came across a bit of a tricky issue with a flat file that is exported from Oracle PBCS with some carriage return issues. End users - when inputting data into PBCS - will often press in a specific data field input screen. When the data gets exported representing a specific record with all the data elements representing that data point (intersection) - think like a SQL record - the record element where the user pressed enter causes that record to break at that point - shifting the rest of the data elements in that record to the next line. This is very bad as each record must have the same amount of elements - causing downstream issues in a mapping. In effect one unique record becomes two broken records.

I need a Powershell script that looks at the improper CR LF (Windows system) and reforms each unique record. However, the majority of the records in the flat file are fine so the code will have to be able to discern the "mostly good" from the "very bad" cases.

My flat file is pipe delimited and has a header element. The header element may not need to be considered as I am simply trying to address the fix - a solution could potentially look at the amount of property values for the header record to determine how to format broken records based off a property count using the pipe delimiter - but not sure that is necessary.

I will be honest - there are Jython scripts I tried to no avail - so I felt given that I have employed a couple Powershell scripts for other reasons in the past that I would use this again. I have a basis of a script for a csv file - but this isn't quite working.

$file = Get-Content 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN.txt'
$file| Foreach-Object {

  foreach ($property in $_.PSObject.Properties) {
  $property.Value = ($property.Value).replace("`r","").replace("`n","")

  }
}

$file|out-file -append 'E:\EPM_Cloud\Exports\BUD_PLN\Data\EXPORT_DATA_BUD_PLN_FINAL.txt'

Here are a few examples of what the before and after case would be if I could get this code to work.

This is supposed to be one record - as you see beginning with "$43K from... the user pressed enter several times. As you see it is pipe delimited - I use the numeric line numbers to show you what I mean since this isn't notepad++. The idea is this should all just be on 1.

Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical  (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above
$92,903 
$14,907 

The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT

This is what the output should look like (I have attached screenshots instead). All in 1.

Contract TBD|#missing|#missing|#missing|#missing|ORNL to Perform Radio-Chemical  (RCA) Measurements|#missing|#missing|#missing|#missing|"$43K from above $92,903 $14,907 The current $150K to be reprogrammed to XXX, plus another $150K from Fuel Fac for this item to be reprogrammed to RES."|#missing|#missing|#missing|"Summary|X0200_FEEBASED|No_BOC|O-xxxx-B999|xx_xxx_xx_xxx|Plan|Active|FY19|BegBalance"|COMMIT

In other cases the line breaks just once - all defined just by how many times the user presses enter.enter image description here

As you see in the data image - you see how the line splits - this is the point of the powershell. As you see next to that screenshot image - other lines are just fine.

Do actual record lines end in CR LF? Could you just replace all instances where both CR and LF are found together with null? This gets complicated because you have nested pipes within quotes. Have you tried just reading the file as a CSV? something like `$Records=Import-Csv C:\Path\To\File.csv -Del '|'`? If that reads it in correctly, with some columns containing multi-line strings but all records being intact, you could loop through all records, and all properties of each record, and remove CRLF. — TheMadTechnician, Jun 25 '19 at 21:39
Hi! thanks for your reply. Yes they in fact all end with CR LF for all the proper cases. However, even the bad rows have the CL LF at the end of row. I guess I just need to know - would these be discrete records even if I remove the CRLF? Would that allow the broken line to re-form? — smomotiu, Jun 25 '19 at 21:46
have you talked with the folks in charge of the data entry app about cleaning their data? it is ... rather horrifying ... that this is not being cleaned when the data is entered. removing blank lines in the data is so very easy to do. [*frown*] — Lee_Dailey, Jun 25 '19 at 22:29
Thanks Lee for that - with it being a Cloud app there might be a way using Groovy to determine that but since that part of the data is so abstracted out I am not even sure how I could look into a multidimensional cube and say to certain intersections "remove spaces" — smomotiu, Jun 27 '19 at 13:02

score 0 · Answer 1 · answered Jun 25 '19 at 22:02

0

So after checking locally you should be able to just import the file as a csv, then loop through everything and remove CRLF from each property on each record, and output to a new file (or the same, but its safer to output to a new file).

$Records = Import-Csv C:\Path\To\File.csv -Delimiter '|'
$Properties = $Records[0].psobject.properties.name
ForEach($Record in $Records){
    ForEach($Property in $Properties){
        $Record.$Property = $Record.$Property -replace "[\r\n]"
    }
}
$Records | Export-Csv C:\Path\To\NewFile.csv -Delimiter '|' -NoTypeInfo

answered Jun 25 '19 at 22:02

TheMadTechnician

34,906
3
42
56

Thanks for the reply! If this is an automated process - is there an easy way to import as a csv? If it is a .txt file - I am assuming I have to convert the file first before the above would work? – smomotiu Jun 26 '19 at 01:58
The name of the file doesn't really matter as long as the contents are like what you described in your post. – TheMadTechnician Jun 26 '19 at 22:17
Thanks again! So I ran it and it removed most of the line breaks. However, there are a few stragglers - cases where the end user hit enter more than once on the same record split up the record into 3 split records. The script moves the 1st split line back to the main record and adds the remaining property values as "" with pipe delimited to fill out the remaining properties for that record. Then the 2nd split line is treated as proper and the rest of the remaining properties are filled with "" for each property for the record that is essentially being derived to meet the property count. – smomotiu Jun 27 '19 at 12:42
So what ends up happening is I have now two messed up records for those cases where the user hits enter more than once - thinking of a way to addresss that low volume case. But your script definitely works on the majority - appreciate that! – smomotiu Jun 27 '19 at 12:44

Removing CR LF improper split line on .txt pipe-delimited flat with Powershell script

1 Answers1