1

I am trying to read in a file that has multiple data formats in a .csv format. I am using Python3.2 and Numpy 1.9. I am using the numpy genfromtxt function to read in the data. I was hoping i could convert the data as I read to store it appropriately instead of processing it later, for which i am using converter functions in the options.

Using multiple converter functions seems to be giving an issue. The code, the input and the output of the code are listed below. As you can see, the first line output is from a different column of the input file than the others.

Has any one used this feature before ? IS there a bug in my code somewhere?

CODE:

 converterfunc_time=   lambda x : (datetime.strptime(x.decode('UTF-8'),'%m/%d/%Y %I:%M:%S %p'))
    def converterfunc_lat(x):
        print(x);    print(x.decode('UTF-8'))
        #return float(x.decode('utf-8').split('N')[1])
    def converterfunc_san(x):
        #print(x)
        return (x.decode('UTF-8'))  



class input_file_processing():
        def __init__(self): 
             self.input_data=(np.genfromtxt('filename',skip_header=1,dtype=None,usecols=(0,1,6,7,8,9,10,13), names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
                              converters=0:converterfunc_time,1:converterfunc_san,6:converterfunc_lat},    delimiter=','))

**INPUT **

input, file, 1
4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33,00.546,W118,00.638,3,11,1,104,102,82,6,18,2048,4039587
4/2/2015 2:13:55 PM,DSN001000861511,03-01-02,0010416164,0,0,N33,00.883,W118,00.208,3,11,1,106,102,88,6,18,2048,2792940
4/2/2015 2:14:44 PM,DSN001000871692,03-01-04,0010408734,0,0,N33,00.876,W118,00.110,3,11,1,105,102,80,6,18,2048,312623
4/2/2015 2:14:52 PM,DSN001000864906,03-01-05,0010055143,0,0,N33,08.000,W118,03.000,3,11,1,107,99,83,6,18,2048,3056425
4/2/2015 2:15:00 PM,DSN001000838651,03-01-06,0010265541,0,0,N33,09.749,W118,00.317,3,11,1,100,110,74,6,14,2048,3737937
4/2/2015 2:15:08 PM,DSN001000609313,03-01-07,0010152885,0,0,N33,05.854,W118,04.107,3,11,1,94,95,62,6,14,2048,8221318
4/2/2015 2:15:19 PM,DSS31967278,03-01-08,0010350817,0,0,N33,04.551,W118,02.359,3,11,1,127,105,77,6,21,2048,21157710
4/2/2015 2:16:08 PM,DSN001000822728,03-01-10,0010051377,0,0,N33,00.899,W118,00.132,3,11,1,116,95,61,6,19,2048,3526254

OUTPUT

b'03-01-01'
03-01-01
b'N33'
N33
b'N33'
N33
b'N33'
N33
b'N33'
N33
b'N33'

Thanks

  • Questions like this will get more help if a test case can be copy-n-pasted. I've got it running, but it required some guess work. – hpaulj May 15 '15 at 23:04

1 Answers1

0

I'm not entirely sure what is going on. But this script runs:

import numpy as np
from datetime import datetime

txt = b"""input, file, 1
4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33,00.546,W118,00.638,3,11,1,104,102,82,6,18,2048,4039587
4/2/2015 2:13:55 PM,DSN001000861511,03-01-02,0010416164,0,0,N34,00.883,W118,00.208,3,11,1,106,102,88,6,18,2048,2792940
4/2/2015 2:14:44 PM,DSN001000871692,03-01-04,0010408734,0,0,N35,00.876,W118,00.110,3,11,1,105,102,80,6,18,2048,312623
4/2/2015 2:14:52 PM,DSN001000864906,03-01-05,0010055143,0,0,N36,08.000,W118,03.000,3,11,1,107,99,83,6,18,2048,3056425
4/2/2015 2:15:00 PM,DSN001000838651,03-01-06,0010265541,0,0,N33,09.749,W118,00.317,3,11,1,100,110,74,6,14,2048,3737937
4/2/2015 2:15:08 PM,DSN001000609313,03-01-07,0010152885,0,0,N33,05.854,W118,04.107,3,11,1,94,95,62,6,14,2048,8221318
"""
txt = txt.splitlines()
#txt = txt[1:]
txt = txt[:3]
converterfunc_time = lambda x : (datetime.strptime(x.decode('UTF-8'),'%m/%d/%Y %I:%M:%S %p'))
def converterfunc_lat(x):
    print('lat ',x, x.decode('UTF-8'))
    x1 = x.decode('utf-8').split('N')
    if len(x1)>1:
        x1 = float(x1[1])
        print('float',x1)
        return x1
    else:
        print('error')
        return "error"
def converterfunc_san(x):
    #print(x)
    return x.decode('UTF-8')

data = np.genfromtxt(txt, skip_header=1,
                    dtype=None,
                    usecols=(0,1,6,7,8,9,10,13),
                    names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
                    delimiter=',')
print(data)
print()
input_data=np.genfromtxt(txt,
            skip_header=1,
            dtype='O,a20,f',
            usecols=(0,1,6,), #(0,1,6,7,8,9,10,13),
            names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
            converters={0:converterfunc_time,
                        1:converterfunc_san,
                        6:converterfunc_lat},
            delimiter=',')
print(input_data)

and produces

1552:~/mypy$ python3 stack30269235.py 
[ (b'4/2/2015 2:13:44 PM', b'DSN001000557867', b'N33', 0.546, b'W118', 0.638, 3, 104)
 (b'4/2/2015 2:13:55 PM', b'DSN001000861511', b'N34', 0.883, b'W118', 0.208, 3, 106)]

lat  b'03-01-01' 03-01-01
error
lat  b'N33' N33
float 33.0
lat  b'N34' N34
float 34.0
[(datetime.datetime(2015, 4, 2, 14, 13, 44), b'DSN001000557867', 33.0)
 (datetime.datetime(2015, 4, 2, 14, 13, 55), b'DSN001000861511', 34.0)]

I've had to fill in some pieces that were missing in your question.

I've added an explicit dtype to make sure I was getting the string and float columns.

And I modified the lat converter so it does not choke on the '03-01-01' input. ...


genfromtxt makes some sort of test run of your converters:

    # Find the value to test:
    if len(first_line):
        testing_value = first_values[i]
    else:
        testing_value = None
    converters[i].update(conv, locked=True,
                         testing_value=testing_value,
                         default=filling_values[i],
                         missing_values=missing_values[i],)
    uc_update.append((i, conv))

Looks like it is taking the first data line:

4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33

splitting it on the delimiter, and using the 3rd string, 03-01-01, as the test value. i.e instead of 6, it is using the index of 6 in your usecols parameter. It's having problems matching the usecols, the converters ids, names and maybe the dtype.

The purpose of this test value is to determine the dtype for the column. This is needed in the dtype=None case. I don't know how it is used if you specify the dtype. Evidently it still runs it.

In tests where I am not skipping columns, it has no problem matching converters and test values.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • You are right in that genfromtxt seems to have an issue matching the "use_cols" and "converter" options . The "names" option matches with the "use_cols". It works fine for me, if i don't use the converter option and process the data afterwords. I don't know if this a bug or if my usage is incorrect. I think for now I will continue down that path and not use the "converter" functions. – user4905443 May 18 '15 at 14:37
  • another thing that I noticed was this ... it might be related on how the function runs the converter functions. When I run the above code with the input data truncated to just one line (to make it easier to read) ... I get this as the output `b'03-01-01' 03-01-01 b'N33' N33 b'N33' N33` which is 3entries when there should have been just 2 (if i include the test runs) – user4905443 May 18 '15 at 15:07
  • Do your test on atleast 2 lines. It handles 1 line files differently. – hpaulj May 18 '15 at 15:13