1

I am trying to use numpy's genfromtxt to read csv's of bond lengths and energies into arrays (to use to generate a potential energy surface and reaction path, so I'll be using scipy.interpolate--hence the need for every value...).

The problem is that genfromtxt is reading the first value of every csv input as NaN. How do I fix this?

As an example, I have the following data in dcm_oh_lengths.csv:

1.0763,1.1263,1.1763,1.2263,1.2763,1.3263,1.3763,1.4263,1.4763,1.5263,1.5763

And I call it with

oh_all = np.genfromtxt(solv+'_oh_lengths.csv',dtype=float,delimiter=',')

And oh_all returns

array([   nan, 1.1263, 1.1763, 1.2263, 1.2763, 1.3263, 1.3763, 1.4263,
       1.4763, 1.5263, 1.5763])

So the first datapoint is read as missing. If I change the data to

,1.0763,1.1263,1.1763,1.2263,1.2763,1.3263,1.3763,1.4263,1.4763,1.5263,1.5763

Then doing the same thing returns

array([   nan, 1.0763, 1.1263, 1.1763, 1.2263, 1.2763, 1.3263, 1.3763,
       1.4263, 1.4763, 1.5263, 1.5763])

As a longer example, the first few lines of the energies (dcm_energies.csv) is:

-7162979.201,-7163010.482,-7163033.634,-7163043.279,-7163060.113,-7163068.894,-7163076.255,-7163078.541,-7163080.908,-7163056.179,-7163081.743
-7163005.74,-7163031.808,-7163050.794,-7163056.603,-7163064.619,-7163070.65,-7163080.606,-7163080.682,-7163081.125,-7163052.444,-7163078.824
-7163024.746,-7163046.199,-7163061.278,-7163063.603,-7163068.336,-7163071.692,-7163079.11,-7163077.25,-7163075.861,-7163043.325,-7163070.561 (...)

And calling it through genfromtxt as above gives:

array([[         nan, -7163010.482, -7163033.634, -7163043.279,
        -7163060.113, -7163068.894, -7163076.255, -7163078.541,
        -7163080.908, -7163056.179, -7163081.743],
       [-7163005.74 , -7163031.808, -7163050.794, -7163056.603,
        -7163064.619, -7163070.65 , -7163080.606, -7163080.682,
        -7163081.125, -7163052.444, -7163078.824],
       [-7163024.746, -7163046.199, -7163061.278, -7163063.603,
        -7163068.336, -7163071.692, -7163079.11 , -7163077.25 ,
        -7163075.861, -7163043.325, -7163070.561], (...)
Clare Birch
  • 13
  • 1
  • 4
  • 1
    Please post a code example and some sample data. Otherwise, it's anyone's guess. – Colin Basnett May 09 '19 at 18:55
  • You are probably going to need to provide some form of example data and code to reproduce this behavior, or else all that can be offered is speculation. You can probably just share the first 10 lines or so in the question, if it is just a csv. – juanpa.arrivillaga May 09 '19 at 18:58
  • Default dtype is float. It returns nan when the column string is not a valid float. Use `usecols` to skip the problem columns. Or custom dtype, names, and or header. Read the docs carefully. – hpaulj May 09 '19 at 19:22

2 Answers2

5

My guess is that the file begins with a "byte order mark" (BOM). How was the file created?

Try this:

with open('dcm_oh_lengths.csv', 'r', encoding='utf-8-sig') as f: 
    oh_all = np.genfromtxt(f, dtype=float, delimiter=',')
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
0

As Warren has pointed out, it is a BOM Issue.

A possibly easier solution I found online is to open your CSV file in notepad++. You can see on the bottom right if you have a UTF-8 BOM file.

If you do, you can just click on encoding and select UTF-8, and save your file. This way eliminates the need to add further code.

SeanK
  • 1
  • 1