0

I have a .csv file named file01.csv that contains many records. Some records are required and some are not. I find that the required records has a string variable “Mi”, but it is not exist into the unnecessary records. So, I want to select the required records based on string value “Mi” in the field for every records.

Finally I want to delete the subsequent fields of each record from the field that contains value “Mi”. Any suggestion and advice is appreciated.

Optional:

  1. In addition, I want to delete the first column.

  2. Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.

My fileO.csv is as follows:

   AA      BB       CC       DD     EE      FF    GG
   1       1_1.csv  (=0      =10"   27"     =57   "Mi"
   0.97    0.9      0.8      NaN    0.9     od    0.2
   2       1_3.csv  (=0      =10"   27"     "Mi"  0.5
   0.97    0.5      0.8      NaN    0.9     od    0.4
   3       1_6.csv  (=0      =10"   "Mi"     =53  cnt
   0.97    0.9      0.8      NaN    0.9     od    0.6
   4       2_6.csv  No Bi    000    000     000   000
   5       2_8.csv  No Bi    000    000     000   000
   6       6_9.csv  less     000    000     000   000
   7       7_9.csv  s(=0     =26"   =46"    "Mi"  121     

My Expected results files (outFile.csv):

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57              
1       3       0     10    27
1       6       0     10 
7       9       0     26    46  
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
Tofazzal
  • 71
  • 8

1 Answers1

1

The following approach should work fine using Python csv module:

import csv
import re
import string

output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']

sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)

def find_mi(row):
    for index, col in enumerate(row):
        if col.find('Mi') != -1:
            return index
    return -1

def sanitise_cell(cell):
    return cell.translate(sanitise_table, nodigits_table)       # Keep digits

f_input = open('fileO.csv', 'rb')
f_output = open('outFile.csv', 'wb')

csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)

input_header = next(f_input)
csv_output.writerow(output_header)

for row in csv_input:
    #print '%2d  %s' % (len(row), row)  
    if len(row) >= 2:
        bb = re.match(r'(\d+)__(\d+).0\.csv', row[1])
        mi = find_mi(row)

        if bb and mi != -1:
            row[:] = row[:mi] + [''] * (len(row) - mi)
            row[:] = [sanitise_cell(col) for col in row]
            row[0] = bb.group(1)
            row[1] = bb.group(2)
            csv_output.writerow(row)

f_input.close()
f_output.close()

outFile.csv will contain the following:

a_id,b_id,CC,DD,EE,FF,GG
1,1,0,10,27,57,
1,3,0,10,27,,
1,6,0,10,,,
7,9,0,26,46,,

Tested using Python 2.6.6

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Hello Evans, Thank you. It shows a simple error as follows: Please take a look. Traceback (most recent call last): File "Test02.py", line 27, in bb = re.match(r'(\d+)_(\d+)\.csv', row[1]) IndexError: list index out of range – Tofazzal Dec 09 '15 at 11:25
  • It implies your input csv file might have some blank lines, or a row with just a single column? I have modified the script to cope with this. – Martin Evans Dec 09 '15 at 12:06
  • Hello Evans, Good morning. There is a simple problem as follows: row[1] = bb.group(2) AttributeError: 'NoneType' object has no attribute 'group'. I am using python 2.6.6. The important point is to select the required records based on the string "Mi" or existance of .csv in the BB column. All the records that do not have .csv extension or string "Mi" should be deleted. Could you please take a look on this issue? Could you please take a look? Thank you. – Tofazzal Dec 10 '15 at 05:15
  • I have added it to the exception list, so it should work now, what version of Python are you using? – Martin Evans Dec 10 '15 at 07:26
  • Evans, Now it is working but produce some unexpeted records. First it select records that does not contain "Mi". And the a_id and b_id as a little problem. a_id contain the value of column AA, and BB contains the full stirng of BB column just deleting the " __". So, I just expect to select only the records that contains "Mi" string or the records that contains .csv in the BB column. Could you pleae just take a look on this issue? Just select the records which has "Mi" string or BB column contain .csv extesion. Thank you. – Tofazzal Dec 10 '15 at 09:25
  • Your data is somehow different to what I am testing with as it is working for me. I have have made some changes to make it more resilient. – Martin Evans Dec 10 '15 at 10:05
  • Evans, It just write the header in the output file. Another simple option is just to select the records where BB column contain the .csv string or extension. Please take look on this. – Tofazzal Dec 10 '15 at 11:02
  • Evans, Thank you. I have tried with the same code and data. It now produce a simple errror like this. File "fileOne.py", line 28, in print '{:2} {}'.format(len(row), row) ValueError: zero length field name in format – Tofazzal Dec 11 '15 at 02:43
  • Hello Evans, Thank you. I can find the display all the rows that contain same data with 7 columns. But the out file produce just the header. 1 ['1,1_1.csv,(=0,"=10""","27""",57,Mi'] 1 ['0.97,0.9,0.8,NaN,0.9,od,0.2'] 1 ['2,1_3.csv,(=0,"=10""","27""",Mi,0.5'] 1 ['0.97,0.5,0.8,NaN,0.9,od,0.4'] 1 ['3,1_6.csv,(=0,"=10""",Mi,53,cnt'] 1 ['0.97,0.9,0.8,NaN,0.9,od,0.6'] 1 ['4, 2_6.csv,No Bi,0,0,0,0'] 1 ['5,2_8.csv,No Bi,0,0,0,0'] 1 ['6,6_9.csv,less,0,0,0,0'] 1 ['7, 7_9.csv,s(=0,"=26""","=46""",Mi,121'] – Tofazzal Dec 11 '15 at 10:47
  • This shows only 1 column is being found per row, not 7. This should work in later versions of Python, what version are you using? – Martin Evans Dec 11 '15 at 11:14
  • Yes, this is the problem. I am using 2.6.6. But I cannot change the version, because other student also use this version. Another way, I find that I can take the required records just partially considering steing in BB column as follows: if '.csv' in row[1]: writer.writerow(row) After that I can take values until find 'Mi' in each records (that you wrote previously). So, now if we can join to take the recored found .csv in BB and then select until 'Mi'. Could you pleae take a look on this? – Tofazzal Dec 11 '15 at 12:20
  • Send me a link to your fileO.csv file and I will then take another look. – Martin Evans Dec 11 '15 at 12:40
  • Evans, Thank you. The file is located in the local server. And access the files from outside of lab not supported. If you give me you mail i can send the file. – Tofazzal Dec 11 '15 at 13:05
  • Evans. Good morning. I paste the file into Obin. here is the ursl: http://0bin.net/paste/ReGcUYT7TAxV0ATF#2zUdcDRyF7VsAhMT9qPfdI6f5x9ki7HWvJ5ysUgbcXf – Tofazzal Dec 12 '15 at 05:01
  • The filenames were not in the correct format. They had two underscores and an additional `.0` at the end. This broke the `re.match`. I get an output file 322 lines long. – Martin Evans Dec 12 '15 at 07:14
  • Evans, Great. It is warking here exactly. – Tofazzal Dec 12 '15 at 11:33
  • Evans, I have posted a new question. It is a manipulation on the results obtained from this program. Could you kindly please take a look on the new question? Here is the link of the new post. http://stackoverflow.com/questions/34241789/add-new-column-in-a-csv-file-and-manipulate-on-the-on-records – Tofazzal Dec 12 '15 at 15:53
  • Evans, Thanks. I expect you will take a loon on my new post in the following link. http://stackoverflow.com/questions/34241789/add-new-column-in-a-csv-file-and-manipulate-on-the-on-records – Tofazzal Dec 14 '15 at 03:51