12

I need to read a .dat file in python which has 12 columns in total and millions of lines of rows. I need to divide column 2,3 and 4 with column 1 for my calculation. So before I load that .dat file, do I need to delete all the other unwanted columns? If not, how do I selectively declare the column and ask python to do the math?

an example of the .dat file would be data.dat

I am new to python , so a little instruction to open , read and calculation would be appreciated.

I have added the code I am using as a starter from your suggestion:

from sys import argv

import pandas as pd



script, filename = argv

txt = open(filename)

print "Here's your file %r:" % filename
print txt.read()

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['mass', 'x-momentum']
dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

and also the error I get through it:

Traceback (most recent call last):
  File "flash.py", line 18, in <module>
    dataframe = pd.read_csv('~/Pictures', delimiter="," , usecols=columns_to_keep)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 529, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 295, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 612, in __init__
    self._make_engine(self.engine)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/trina/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1119, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 518, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5030)
ValueError: No columns to parse from file
bhjghjh
  • 889
  • 3
  • 16
  • 42
  • I'd guess `'~/Pictures'` does not contain the data you're looking for. Try reading the correct file. – Jan Christoph Terasa Jun 22 '16 at 05:13
  • I checked both the .dat and .py is in my ~/Pictures folder – bhjghjh Jun 22 '16 at 07:00
  • 1
    Do you expect `pandas.read_csv` to guess which file you want to open? Try `pd.read_csv('~/Pictures/data.dat', delimiter="," , usecols=columns_to_keep)`. Alternatively, remove `txt.read()` and pass `txt` as a first argument to `read_csv`. – Jan Christoph Terasa Jun 22 '16 at 12:41
  • thanks, I didn't notice that, now I have a new error, do I also have to define the headings here? [ValueError: 'x-momentum' is not in list] – bhjghjh Jun 22 '16 at 13:51
  • @bhjghjh The code assumes that your data file actually has a first row defining field names of "x-momentum", etc. If your data file doesn't have a first row defining field names, you can change the code to use the column indexes instead, e.g. columns_to_keep = [2, 3, 4, 7]. Or add a first row with column names. – Bill Jun 22 '16 at 13:57
  • @Bill, I think the file already has a first row with column names, (the # lines, right?) if you please click on the link to see the .dat file and double check it , I would very much appreciate that – bhjghjh Jun 22 '16 at 14:07
  • Your file is 6MB. Please upload a small, minimal example. A service like pastebin.com is far more fitting, because people can immediately see the file: http://pastebin.com/XNQie3cv – Jan Christoph Terasa Jun 23 '16 at 07:16

6 Answers6

12

After looking at your flash.dat file, it's clear you need to do a little clean up before you process it. The following code converts it to a CSV file:

import csv

# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./flash.dat").readlines()]

# write it as a new CSV file
with open("./flash.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(datContent)

Now, use Pandas to compute new column.

import pandas as pd

def your_func(row):
    return row['x-momentum'] / row['mass']

columns_to_keep = ['#time', 'x-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['new_column'] = dataframe.apply(your_func, axis=1)

print dataframe
Bill
  • 1,247
  • 8
  • 12
  • I think the `delimiter` cannot be `,` – ppaulojr Jun 22 '16 at 00:06
  • 1
    @ppaulojr according to docs, ',' is the default. I just included it so it was obvious it could be changed http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html – Bill Jun 22 '16 at 00:17
  • 1
    yeah, but since you are doing a functional example, it would be interesting to use the right delimiter to his case – ppaulojr Jun 22 '16 at 00:19
  • 1
    ValueError: No columns to parse from file, that's the error I got while executing the code, any help? – bhjghjh Jun 22 '16 at 02:34
  • @bhjghjh took a look at your flash.dat, updated with a complete working solution – Bill Jun 22 '16 at 14:55
  • @Bill, works perfect thanks a lot. I would ask for a last help from you, is there a way I can retrieve all the rows in that frame? I need to plot that column against time. (it shows till 29s of data, then skips to last few rows). or can the plotting be made along with data making, maybe with some plot command? – bhjghjh Jun 22 '16 at 15:33
  • @bhjghjh Can you post this as a new question? The moderators don't like ongoing comment threads that stray from the original question. – Bill Jun 22 '16 at 15:36
  • The number of coolumn headers does not match the number of columns in your example. This will lead to _wrong_ results. You have to make absolutely sure that there are no whitespace in the column names! – Jan Christoph Terasa Jun 23 '16 at 07:43
  • `len(datContent[0]) != len(datContent[1])`! The data is not NxM, and hence is not a rectangular matrix. All the column headers after the first one with a whitespace are wrong. – Jan Christoph Terasa Jun 23 '16 at 08:30
3
train=pd.read_csv("Path",sep=" ::",header=None)

Now you can access the dat file.

train.columns=["A","B","C"]# Number of columns you can see in the dat file.

then you can use this as csv files.

aristotll
  • 8,694
  • 6
  • 33
  • 53
Nisarg Bhatt
  • 379
  • 1
  • 13
2

Try something like:

datContent = [i.strip().split() for i in open("filename.dat").readlines()]

Then you'll have your data in a list.

If you want to have something more sophisticated you can use Pandas, see the linked cookbook.

ppaulojr
  • 3,579
  • 4
  • 29
  • 56
2

Consider using the general read_table() function (of which read_csv() is a special type) where pandas can easily import the specific .dat file specifying the space separator, sep='\s+'. Additionally, no defined function with apply() is needed for column by column calculation.

Below numpy is used to condition for division by zero. Also, the example .dat file's first column is #time and columns 2, 3, 4 are x-momentum, y-momentum, and mass (different expression in your code but revise as needed).

import pandas as pd
import numpy as np

columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'mass']
df = pd.read_table("flash.dat", sep="\s+", usecols=columns_to_keep)

df['mass_per_time'] = np.where(df['#time'] > 0, df['mass']/df['#time'], np.nan)
df['x-momentum_per_time'] = np.where(df['#time'] > 0, df['x-momentum']/df['#time'], np.nan)
df['y-momentum_per_time'] = np.where(df['#time'] > 0, df['y-momentum']/df['#time'], np.nan)
Parfait
  • 104,375
  • 17
  • 94
  • 125
1

The problem you face here is that the column header names have whitespaces in them. You need to fix/ignore that to make pandas.read_csv behave nicely. This will read the column header names into a list based on the fixed length of the field name strings:

import pandas

with open('flash.dat') as f:
    header = f.readline()[2:-1]
    header_fixed = [header[i*23:(i+1)*23].strip() for i in range(26)]
    header_fixed[0] = header_fixed[0][1:] # remove '#' from time

    # pandas doesn't handle "Infinity" properly, read Infinity as NaN, then convert back to infinity
    df = pandas.read_csv(f, sep='\s+', names=header_fixed, na_values="Infinity")
    df.fillna(pandas.np.inf, inplace=True)

# processing
df['new_column'] = df['x-momentum'] / df['mass']
Jan Christoph Terasa
  • 5,781
  • 24
  • 34
0

Load the dat file as a list, and then convert the list to the desired data format, i.e., the numpy array.

For example, to load the Australian Credit Approval Data Set (found in UCI machine learning repository) in dat format, I use the following code:

dataset=[i.strip().split() for i in open("./data2/australian.dat").readlines()]

australian=np.full([len(dataset),len(dataset[0])],None)

for i in range(len(dataset)):

for j in range(len(dataset[0])):

    if j==0:
        australian[i,j]=int(dataset[i][j])    
    else:
        australian[i,j]=float(dataset[i][j])

np.save('./data2/australian.npy', australian)
Mingming Qiu
  • 333
  • 4
  • 9