Delete or remove unexpected records and strings based on multiple criteria by python or R script

Question

I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.

What I want to do is as follows:

Delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.
Assign new name to each of the columns.
Delete the records that contain strings like less in the CC column.
Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.

My fileOne.csv is as follows:

   AA      BB       CC       DD     EE      FF    GG
   1       1_1.csv  (=0      =10"   27"     =57   "Mi"
   0.97    0.9      0.8      NaN    0.9     od    0.2
   2       1_3.csv  (=0      =10"   27"     "Mi"  0.5
   0.97    0.5      0.8      NaN    0.9     od    0.4
   3       1_6.csv  (=0      =10"   "Mi"     =53  cnt
   0.97    0.9      0.8      NaN    0.9     od    0.6
   4       2_6.csv  No Bi    000    000     000   000
   5       2_8.csv  No Bi    000    000     000   000
   6       6_9.csv  less     000    000     000   000
   7       7_9.csv  s(=0     =26"   =46"    "Mi"  121

My 1st expected results files would be as follows:

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57    Mi              
1       3       0     10    27    Mi    0.5
1       6       0     10    Mi    53    cnt 
7       9       0     26    46    Mi    121

My final expected results files would be as follows:

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57              
1       3       0     10    27
1       6       0     10 
7       9       0     26    46

Martin Evans · Accepted Answer · 2015-12-09T09:30:53.420

1

This can be achieved with the following Python script:

import csv
import re
import string

output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']

sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)

def sanitise_cell(cell):
    return cell.translate(sanitise_table, nodigits_table)       # Keep digits

with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    input_header = next(f_input)
    csv_output.writerow(output_header)

    for row in csv_input:
        bb = re.match(r'(\d+)_(\d+)\.csv', row[1])

        if bb and row[2] not in ['No Bi', 'less']:
            # Remove all columns after 'Mi' if present
            try:
                mi = row.index('Mi')
                row[:] = row[:mi] + [''] * (len(row) - mi)
            except ValueError:
                pass

            row[:] = [sanitise_cell(col) for col in row]
            row[0] = bb.group(1)
            row[1] = bb.group(2)
            csv_output.writerow(row)

To simply remove Mi columns from an existing file the following can be used:

import csv

with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    for row in csv_input:
        try:
            mi = row.index('Mi')
            row[:] = row[:mi] + [''] * (len(row) - mi)
        except ValueError:
            pass

        csv_output.writerow(row)

Tested using Python 2.7.9

edited Dec 09 '15 at 09:30

answered Dec 04 '15 at 07:22

Martin Evans

45,791
17
81
97

Hello Evans, Thank you very much. I got a simple error as follows, please take a look: sanitise_table = string.maketrans("","") AttributeError: 'module' object has no attribute 'maketrans' – Tofazzal Dec 04 '15 at 08:18
Try using Python 2.x. – Martin Evans Dec 04 '15 at 08:20
Hello Evans, Thank you. I have run the program in python 2.7.10. There is no error now. But the program produce only he header into the new file. May be there is a small problem. Could you please take a look. Thanks – Tofazzal Dec 04 '15 at 17:19
This means your input file does not match the sample data you have given in the question. If you add `print bb` I am guessing it will be `None`. It is looking for `X_Y.csv` as a filename. – Martin Evans Dec 06 '15 at 14:35
Hello Evans, Thank you. Based on your sample, finally I have correct my data fram and got the result. I need a little more work. I find tha I do not need any informaion after I find 'Mi' in each records. Mi means Missing. So, after Missing I do not need any field value in each records. That means delete everythin all after find 'Mi'. Could you please let me know how I can do this in a separate python program, i.e., an individual program to do this task. Thank you very much. – Tofazzal Dec 09 '15 at 07:27
Hello Evans, Thank you. I find a simple error as follow. In addition, is it possible to write a separate program to delete from Mi. I may need it to use it different purpose. File "fileONe.py", line 28 except ValueError, e: ^ SyntaxError: invalid syntax – Tofazzal Dec 09 '15 at 09:11
Depending on your Python version, trying removing `, e` from the line, or replacing it with `as e` – Martin Evans Dec 09 '15 at 09:16
Evans, Thank you very much. Great and excellent! I have got the result. Thank you. – Tofazzal Dec 09 '15 at 10:03
Hello Evans, Another simple point. Is it possible if I want to take only records which have this condition. For instance, in my fileOne.csv (above) I have many records, finally I find that I only need the records which have "Mi". So, based on condition 'Mi', I can take all the records which has "Mi" and then remove fields from 'Mi'. Is this possible? – Tofazzal Dec 09 '15 at 10:14
It is very straight forward, but to avoid confusion on this question (for other readers), I suggest you start a new question. – Martin Evans Dec 09 '15 at 10:16
Hello Evans, I posted a nes thread in the following url. http://stackoverflow.com/questions/34176960/select-records-based-on-the-specific-index-string-value-and-then-remove-subseque – Tofazzal Dec 09 '15 at 10:56

Delete or remove unexpected records and strings based on multiple criteria by python or R script

1 Answers1

Linked