Extracting selected columns from a datafile using python

Question

I have a data file like this

0.000       1.185e-01  1.185e-01  3.660e-02  2.962e-02  0.000e+00  0.000e+00  0.000e+00  0.000e+00  0.000e+00
0.001       1.185e-01  1.185e-01  3.660e-02  2.962e-02  -1.534e-02  -1.534e-02  8.000e-31  8.000e-31  0.000e+00
0.002       1.185e-01  1.185e-01  3.659e-02  2.961e-02  -1.541e-02  -1.541e-02  -6.163e-01  -6.163e-01  -4.284e-05
0.003       1.186e-01  1.186e-01  3.657e-02  2.959e-02  -1.547e-02  -1.547e-02  -8.000e-31  -8.000e-31  0.000e+00
0.004       1.186e-01  1.186e-01  3.657e-02  2.959e-02  -2.005e-32  -2.005e-32  -8.000e-31  -8.000e-31  0.000e+00
0.005       1.186e-01  1.186e-01  3.657e-02  2.959e-02  -2.005e-32  -2.005e-32  -8.000e-31  -8.000e-31  0.000e+00
0.006       1.187e-01  1.186e-01  3.657e-02  2.959e-02  -2.005e-32  -2.005e-32  -8.000e-31  -8.000e-31  0.000e+00
0.007       1.187e-01  1.187e-01  3.657e-02  2.959e-02  -2.005e-32  -2.005e-32  -8.000e-31  -8.000e-31  0.000e+00
0.008       1.188e-01  1.187e-01  3.657e-02  2.959e-02  -2.005e-32  -2.005e-32  -8.000e-31  -8.000e-31  0.000e+00
0.009       1.188e-01  1.187e-01  3.657e-02  2.959e-02  -2.005e-32  -2.005e-32  -8.000e-31  -8.000e-31  0.000e+00

I want to copy only selected columns from this file to another file. Suppose I copy the 1st, 2nd and 6th columns to a file, then that file should look like

0.000       1.185e-01  0.000e+00
0.001       1.185e-01  -1.534e-02
0.002       1.185e-01  -1.541e-02
0.003       1.186e-01  -1.547e-02
0.004       1.186e-01  -2.005e-32
0.005       1.186e-01  -2.005e-32
0.006       1.187e-01  -2.005e-32
0.007       1.187e-01  -2.005e-32
0.008       1.188e-01  -2.005e-32
0.009       1.188e-01  -2.005e-32

This is a very large formatted text file which was initially written like this

f=open('myMD.dat','w')
s='%8.3e  %8.3e  %8.3e  %8.3e  %8.3e  %8.3e  %8.3e  %8.3e  %8.3e\t\t'%(xpos1[i],ypos1[i],xvel1[i],yvel1[i],xacc1[i],yacc1[i],xforc[i],yforc[i],potn[i])
f.write(s)
f.close()

I am programming in python. How can I do this?

I think you need to have a bash at this with pandas, just as lists of lists or something else first and then ask on SO if you are having specific issues. I'd recommend with grabbing this data using a pandas dataframe personally. — Noel Evans, Apr 21 '16 at 15:35
Pandas is overkill for this, I think. I'd use `numpy.loadtxt` to read it into a numpy array. then `np.transpose` to get columns, then copy those to a new array and save that array with `np.savetxt`. — roadrunner66, Apr 21 '16 at 15:43
@roadrunner, this data file is huge. 500 columns, 10000 rows. Would it be good to read it to a numpy array? — kanayamalakar, Apr 21 '16 at 15:51
@NoelEvans, I have never used pandas before. Can you please tell me in more details what I need to do? — kanayamalakar, Apr 21 '16 at 15:54
@kanayamalakar: That does not seem like a very large file to me. 500 x 16 x 10000 / 1024 / 1024 = 76.3 MB. That should easily fit into memory unless you are doing this on an embedded system with limited resources. — , Apr 21 '16 at 17:16

score 1 · Answer 1 · 2016-04-21T17:46:44.557

1

This will read a given input file and select rows using a given comma separated list of rows:

import sys
input_name = sys.argv[1]
column_list = [(int(x) - 1) for x in sys.argv[2].split(',')]
with open(input_name) as input_file:
    for line in input_file:
        row = line.split()
        for col in column_list:
            print row[col],
        print ""

It reads and prints one line at a time, which means it should be able to handle an arbitrarily large input file. Using your example data as input.txt, I ran it like this:

python selected_columns.py input.txt 1,2,6

It produced the following output (ellipsis used to show lines removed for brevity):

0.000 1.185e-01 0.000e+00 
0.001 1.185e-01 -1.534e-02 
...
0.009 1.188e-01 -2.005e-32

You can save the output to a file using redirection:

python selected_columns.py input.txt 1,2,6 > output.txt

edited Apr 21 '16 at 17:46

answered Apr 21 '16 at 17:30

I don't know if I used it right, but it returns `IndexError: list index out of range` at the line ` input_name = sys.argv[1]`. – kanayamalakar Apr 21 '16 at 17:46
@kanayamalakar: Did you pass a file name on the command line? – Apr 21 '16 at 17:48
Yes I did, actually I had not compiled the way you had mentioned. But now it works absolutely fine. Thanks for your time. – kanayamalakar Apr 21 '16 at 17:51

nigel222 · Accepted Answer · 2016-04-21T18:00:27.633

Far simpler, yet quite versatile. Assuming none of the fields contain any spaces you can simply use the split method on each line to get a list of fields and then print the ones you want. Here's a script that lets you specify which columns and a separator string for the output.

Note: at no point are we converting between string and float. This preserves the previous fromatting of the numbers and for a huge file, saves a lot of CPU!

COLS=0,1,5  # the columns you want. The first is numbered zero.
            # NB its a tuple: COLS=0, for one column, mandatory trailing comma

SEP = ', '  # the string you want to separate the columns in the output

INFILE='t.txt'      # file to read from
OUTFILE='out.txt'   # file to write to

f = open( INFILE, 'r')
g = open( OUTFILE, 'w')

for line in f.readlines():
   x = line.split()
   if x != []:  # ignore blank lines

       y = [ x[i] for i in COLS ]
       outline = SEP.join( '{}'.format(q) for q in y )
       g.write( outline+'\n')

Just realized, '{}'.format(q) for q in y is overkill here. y is an array of strings to be output unchanged so SEP.join(y) is all you need here. But showing the pattern for applying a format to a list of similar elements is probably useful.

score 0 · Answer 3 · answered Apr 21 '16 at 15:42

0

What kind of file is this? Comma delimited? Plain text? If it is a *.csv file you can try this:

openFile = open('filepath', 'r')
dataIn = csv.reader(openFile, delimiter=' ')
col1, col2, col6 = [], [], []
for rows in dataIn:
    col1.append(rows[0])
    col2.append(rows[1])
    col6.append(rows[5])

answered Apr 21 '16 at 15:42

Ma0

15,057
4
35
65

It is a formatted text file- .dat format. Will this code work for this file? – kanayamalakar Apr 21 '16 at 15:58
Nope. But from your edit above i see that you are the one writing it as well. Why not create the second file at the same time ? Just add an "s2" and an "f2" object. No? – Ma0 Apr 21 '16 at 16:04
Yeah , that's because this is simulation data and thus I need to make my program write the minimum in order to save computation time , and then do all analysis later. – kanayamalakar Apr 21 '16 at 16:17

NonlinearFruit · Answer 4 · 2016-04-21T17:17:22.587

Column Data

This method will work for any data file that meets these requirements:

The data is separated by white space [ie space, tab, return]
The data entries do not contain white space

The sample data given meets these requirements. This method uses Python 3 and Regular Expressions to pull out specific columns from the data.

To use this simply:

Call the init(file) function once
- Passing in the path to the data file
Then call getColm(i) as many times as needed
- Pass in the column you need
- It will return an array of that columns entries.

Here is the code. Make sure to import the regular expression library re.

import re

matrixOfFile = []

# Prep the matrixOfFile variable
def init(filepath):
    global matrixOfFile
    # Read the file content
    with open(filepath,'r') as file:
        fileContent = file.read()       
    # Split the file into rows
    rows = fileContent.split("\n")

    # Split rows into entries and add them to matrixOfFile
    for row in rows: # For each row, find all of the entries in the row that
                     # are non-space characters and add those entries to the
                     # matrix
        matrixOfFile.append(re.findall("\S+",row))

# Returns the ith column of the matrixOfFile
# i should be an int between 0 and len(matrixOfFile[0])
def getColm(i):
    global matrixOfFile
    if i<0 or i>=len(matrixOfFile[0]):
        raise ValueError('Column '+str(i)+' does not exist')
    colum = []
    for row in matrixOfFile: # For each row, add whatever is in the ith 
                  # column to colum
        colum.append(row[i])

    return colum

# Absolute filepath might be necessary ( eg "C:/Windows/Something/Users/Documents/data.dat" )
init("data.dat") 
# Gets the first, second and sixth columns of data
print(getColm(0))
print(getColm(1))
print(getColm(5))

Thanks, but it shows `AttributeError: 'file' object has no attribute 'split'` at the line `rows = fileContent.split("\n")`. — kanayamalakar, Apr 21 '16 at 16:28
@kanayamalakar Sorry about that. It should be fixed now. I was using an online compiler to test my code and didn't get to test the file i/o. — NonlinearFruit, Apr 21 '16 at 17:22

Extracting selected columns from a datafile using python

4 Answers4

Column Data

To use this simply: