0

I am attempting to read in a large data file and convert it to a format that my other scripts can better handle.

Each datafile has a series of headers followed by two columns referring the relevant data points. This is then followed by another series of headers (in the same column) and by the next set of relevant data points. So for example:

Sample of a data file

I need to sort through the lines and write them to a file made up of multiple columns. So the first column for each set of data is the same (the frequency), so what I'm trying to get should look like the following:

Sorted data set

I'm new to python and as yet have to find any even half successful way of managing this. I've tried a basic if statement:

def LoadData(filename):

Datafile = open(filename,'r')

# Define empty lists to read the values into
a1 = []
data=Datafile.readlines()

index = 1   
for line in range(14,len(data)):
    w=data[line].split()
    if type(w[0]) == float:
        a1.append(w[index])
    if re.findall(r'[\w.]THz', w[0]):
        index = index +1

return a1

But since I can't define a list to be multidimensional I don't know how I can progress to have it assign the next series of data values to another column. Defining a numpy array doesn't help me either since I need to know the exact dimensions to start with.

I'm certain there must be a relatively straight forward way to do this, but I've failed to find it. I'd appreciate any help!

This is the data opened with notepad as requested in a comment: Notepad version of data

user1827345
  • 41
  • 1
  • 4
  • 1
    Where you are parsing data from text files, check [the ``csv`` module](http://docs.python.org/3/library/csv.html) - despite it's name it'll handle different formats too. – Gareth Latty Dec 03 '12 at 15:40
  • Is the data in Excel? Or - are the examples just how to display the data? – Jon Clements Dec 03 '12 at 15:42
  • The data is a tab delimited text file. At present I was using excel to manually sort the data as needed, and that's what I've displayed above. But it's now getting to the point where I'll have several of these files and doing each manually is a bit time consuming, and I'm assuming unnecessary... – user1827345 Dec 03 '12 at 15:47
  • Can you post the raw data before the Excel step? I is probably easier to just read the raw files in Python rather than an intermediate format from Excel –  Dec 03 '12 at 16:17
  • Well the data is just a tab delimited text file. I was only viewing it in Excel. If opened in a text editor it looks as attached above in my edit... – user1827345 Dec 04 '12 at 08:47

3 Answers3

0

Althought I have not used it (yet), pandas look that will be a good tool for your usecase. Check the dataframe data structure http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

and the IO tools http://pandas.pydata.org/pandas-docs/stable/io.html

etzourid
  • 331
  • 4
  • 18
0

You can have lists in lists to emulate your desired two-dimensional structure:

arr= []
arr.append(['R1C1', 'R1C2', 'R1C3'])
arr.append(['R2C1', 'R2C2', 'R2C3'])
arr.append(['R3C1', 'R3C2', 'R3C3', 'R3C4'])
# each list can have as many elements as required
print(arr[-1][1]) # last row, second column

Should your algorithm require sparse arrays and instant access to elements, you can use a dictionary with a tuple as index:

arr= {}
arr[1,2]= 'R1C2'
arr[10,5]= 'R10C5'
print(arr[1,2])
tzot
  • 92,761
  • 29
  • 141
  • 204
0

It is not entirely clear to me how the numbers in your example are connected to the result table you want to create, but here is a function that will read in a tab delimited file and convert it to a 2d array where each field that represents a float is converted to a Python float and the rest remains a string.

def load_data(filename):
    data = []
    with open(filename) as f:
        for line in f:
            fields = []
            for field in line.strip().split("\t"):
                try:
                    fields.append(float(field))
                except Exception, e:
                    fields.append(field)
            data.append(fields)
    return data

Example:

8.57    0.21
8.58    0.22
8.59    0.23

Curvelabel    = Balance [ 1 (6) ]
Filename      = bil

8.57    0.21
8.58    0.22
8.59    0.23

Output:

[[8.57, 0.21], [8.58, 0.22], [8.59, 0.23],
[''],
['Curvelabel    = Balance [ 1 (6) ]'],
['Filename      = bil'],
[''],
[8.57, 0.21], [8.58, 0.22], [8.59, 0.23]]
hochl
  • 12,524
  • 10
  • 53
  • 87