0

I have a requirement to count the number of records in a text file which is 600MB in size and some sample data is below. The data in this flatfile is delimited. The column delimiter used is pipe. And the data is qualified with a special character (in this case ±). Some of the values have a new-line character, because of which I'm getting wrong counts. In the below example, when I'm reading one line at a time, I'm getting 9 records but ideally it should be 7. The data is better represented in the image: enter image description here

±0000958779±|±KR±|±FEOUL±|±2F, 759, YEOKFAM-DONF, FANFNAM-FU±|±±
±0000958774±|±KR±|±BUFAN±|±208-7, CHOEUM-DONF, BUFANJIN-FU±|±±
±0000518874±|±RU±|±M.O, F. Odincovo±|±ZAO " Mremium Otel Menedjment"±|±±
±0000518971±|±RU±|±Famara±|±ul.Molevaya,80,
FamarFkaya ForodFka±|±±
±0000519050±|±RU±|±MoF VniiFFok±|±VlaFenko Ol'Fa VaFil'evna±|±±
±0000519027±|±RU±|±Ft-MeterFburF±|±DorozhinFkaya LariFa Anatol 
evna±|±±
±0000958779±|±KR±|±FEOUL±|±MART AV CLINIC(CLOFED)±|±±
Adrian Klaver
  • 15,886
  • 2
  • 17
  • 28
VKS
  • 1
  • 1

1 Answers1

0
cat count.csv 
±0000958779±|±KR±|±FEOUL±|±2F, 759, YEOKFAM-DONF, FANFNAM-FU±|±±
±0000958774±|±KR±|±BUFAN±|±208-7, CHOEUM-DONF, BUFANJIN-FU±|±±
±0000518874±|±RU±|±M.O, F. Odincovo±|±ZAO " Mremium Otel Menedjment"±|±±
±0000518971±|±RU±|±Famara±|±ul.Molevaya,80,
FamarFkaya ForodFka±|±±
±0000519050±|±RU±|±MoF VniiFFok±|±VlaFenko Ol'Fa VaFil'evna±|±±
±0000519027±|±RU±|±Ft-MeterFburF±|±DorozhinFkaya LariFa Anatol
evna±|±±
±0000958779±|±KR±|±FEOUL±|±MART AV CLINIC(CLOFED)±|±±


import csv

with open('count.csv', newline='') as csv_file:
    reader =csv.reader(csv_file, delimiter='|', quotechar='±')
    ct = 0
    for row in reader:
        print(row)
        ct += 1
    print(ct)

['0000958779', 'KR', 'FEOUL', '2F, 759, YEOKFAM-DONF, FANFNAM-FU', '']
['0000958774', 'KR', 'BUFAN', '208-7, CHOEUM-DONF, BUFANJIN-FU', '']
['0000518874', 'RU', 'M.O, F. Odincovo', 'ZAO " Mremium Otel Menedjment"', '']
['0000518971', 'RU', 'Famara', 'ul.Molevaya,80,\nFamarFkaya ForodFka', '']
['0000519050', 'RU', 'MoF VniiFFok', "VlaFenko Ol'Fa VaFil'evna", '']
['0000519027', 'RU', 'Ft-MeterFburF', 'DorozhinFkaya LariFa Anatol\nevna', '']
['0000958779', 'KR', 'FEOUL', 'MART AV CLINIC(CLOFED)', '']
7

Adrian Klaver
  • 15,886
  • 2
  • 17
  • 28
  • Hi Adrian, Thank you for your response. But, the file type I'm receiving from source provider is .txt (and not .csv). When I try the above script, I'm still getting 9 count. – VKS Sep 01 '22 at 21:10
  • The file extension is irrelevant, it is the content that matters. If the content is as you show in your question and what I copied into `count.csv` then it should work. The fact that it does not means there is something different on your end. Show your code and its output in your question. – Adrian Klaver Sep 01 '22 at 21:18
  • @VKS This is a good answer. you can change `csv_file` to `f` and `count.csv` to `count.txt` if it removes confusion for you... my answer would have been very similar. – D.L Sep 01 '22 at 23:20
  • I tried with both .csv and .txt but I am still getting 9 count. import csv ​ with open('C:\\Users\\username\\Test_Customer.txt', newline='') as csv_file: reader =csv.reader(csv_file, delimiter='|', quotechar='±') ct = 0 for row in reader: print(row) ct += 1 print(ct) – VKS Sep 02 '22 at 01:01
  • Tried a different way and it worked. Need to check if it works for my 600MB file. file1 = open('C:\\Users\\username\\Test_Customer.txt', 'r') Lines = file1.readlines() count = 0 for line in Lines: l = line.strip() if l[-1] == '±': count += 1 print(count) – VKS Sep 02 '22 at 01:06
  • Per previous instructions add the code to your question not as comments. – Adrian Klaver Sep 02 '22 at 04:14