0

Requires: Import large tab delimited text file, varying from 3.5-5.5 million lines. Export only matching lines to a separate file.

File Contents (does have tabs at end of line too):

Id  TypeId  SField  File    ADate   AField  CField  Status  RDate   Rec FiId    CRec
15773363    2614    ZZ  HL3 07/06/12    3271.75 3050.00 PPP 11/23/12    698         
25073163    3215    ZZ  TDU 12/04/17    1204.56     CCR     716     P304    
25259159    3215    ZZ  TDU 01/04/18    539.44      CCR     649     P303    
25677650    3259    ZZ  TDU 03/03/18    872.06      CCR     653     P101    
26096141    3259    ZZ  TDU 05/03/18    880.61      CCR     647     P201    
26282137    3215    ZZ  TDU 06/04/18    1247.27     CCR     698     P304    
26979622    3258    ZZ  TDU 09/04/18    1942.33 1000.00 ACT     698     PP89    
27212145    3215    ZZ  TDU 10/03/18    2487.54     CCR     609     P202    
27444640    3258    ZZ  H5  11/03/18    23182.36    2635.00 BBB     640 1234    PP89    
27909630    3216    ZZ  XC8 01/04/19    3370.00     ACT     691     P316    
28560616    3259    ZZ  BDX 04/03/19    460.98      ACT     607     P100    
25073192    3215    ZZ  TDU 12/04/17    445.50      CCR     713 1234    P202    
25259188    3215    ZZ  TDU 01/04/18    899.35      CCR     667     P303    
25677679    3259    ZZ  TDU 03/03/18    3863.95     CCR     18328       P101    
26096170    3259    ZZ  TDU 05/03/18    10321.97        BBB     649 8754    P402    

I need to read in this file and export only lines that have FiId populated to a new file.

To that effect I have:

((Get-Content $InFile -Raw -ReadCount 4000) -Match "(?m)^.*\t[0-9]{1,4}\t[A-Z0-9]{0,9}\t$") |
            Set-Content $OutFile

I've tested the regex and variations [https://regex101.com/r/jJ39oy/1] however I can't get it to export only what I want. It either exports the whole file or it doesn't export anything. In total there are about 30k lines that I need exported every day from this file.

Ryan.James
  • 47
  • 5
  • you don't have any tabs in your sample data ... that makes it nearly impossible to determine what data goes with what column. [*sigh ...*] ///// also, where is the `Fild` column? i see what LOOKS like `Rec FiId ` - is that actually two columns? – Lee_Dailey Oct 23 '19 at 21:53
  • 1
    Try `Get-Content $InFile | Select-String -Pattern '^.*\t[0-9]{1,4}\t[A-Z0-9]{0,9}\s*$' | Select -ExpandProperty line | Set-Content $OutFile`, see [this thread](https://stackoverflow.com/a/41055543/3832970) – Wiktor Stribiżew Oct 23 '19 at 21:53
  • copy/pasting the sample data in again it still converted to tabs. the regex linked has tabs on it. – Ryan.James Oct 23 '19 at 22:04

1 Answers1

0

I think the easiest method would be to import your file using Import-CSV and then filter on the FiId field having a value and export it back out using Export-CSV:

$values = Import-Csv -Path 'file1.ext' -Delimeter "`t" #specifies the tab delimeter
$values | Where-Object {-not [string]::IsNullOrEmpty($_.FiId)} | Export-Csv -Path 'file2.ext' -Delimeter "`t"
(Get-Content -Path 'file2.ext').Replace('"','') | Out-File -FilePath 'file3.tdv'

Unfortunately Export-Csv adds quotes, but they are easy to get rid of. No more RegEx nightmare.

EDIT: In the event you like one-liners:

(Import-Csv -Path .\test.tdv -Delimiter "`t" | Where-Object {-not [string]::IsNullOrEmpty($_.FiId)} | ConvertTo-Csv -Delimiter "`t").Replace('"','') | Out-File -FilePath .\filtered.tdv

This also cuts out the file in the middle. You could use the same technique in a multi-line process and just assign the results to variables.

Scott Heath
  • 830
  • 7
  • 5