How can I format a file to a multidimensional numpy array for my AI

Question

I have a training set called data.txt stored for my AI. The AI should get 5 X inputs each run and one solution/answer which is the array y. The array X should look like following: [[x1,x2,x3,x4,x5],[x1,x2....x5],....] I tested it with 2 * 5 inputs and following came out:

    [2.21600000e+05 2.02000000e+03 2.43738600e+06 1.09990343e+01
 9.11552347e-01 2.21600000e+05 2.02000000e+03 2.43738600e+06
 1.09990343e+01 9.11552347e-01 1.00000000e+01 1.00000000e+00
 5.72000000e+02 5.72000000e+01 1.00000000e+01]

What I want is following:

[[221600,2020,2437386,10.999034296028881,0.9115523465703971],
 [10,1,572,57.2,10.0]]

The answer array y is fine. It is: [0.,0.]

The code:

import numpy
X=np.array([])
y=np.array([])
lineX=np.array([])
i=0
linenumber=0
with open('data.txt') as file:
    for line in file:
        dataline=line.rstrip()
        dataline=float(dataline)
        i+=1
        linenumber+=1

        if i != 6:
            lineX=np.append(lineX,dataline)
        else:
            X=np.append(X,lineX,axis=0)
            i=0
            y=np.append(y,dataline)
print(X)
print(y)

And the file (the original has about 800 lines so I shortened it)

221600
2020
2437386
10.999034296028881
0.9115523465703971
0
10
1
572
57.2
10.0
0

The first five lines in the file are the inputs x1-x5 and the sixth line is y (the answers) and so on.

How can I get it working?

Eumel · Accepted Answer · 2022-04-19T12:44:02.123

0

We will need two steps for this:

data = []
with open('data.txt') as file:
    for line in file:
        dataline=line.rstrip()
        dataline=float(dataline)
        data.append(dataline)
data= np.array(data)

First we put everything in a numpy array. There are more efficient ways to read in the file i would assume i.e. pandas reading it as csv but for 800 values that shouldnt matter.

data = data.reshape(-1,6)
X = data[:,0:5]
y = data[:,5]

In the second step we split the array into full samples so columns 0-4 are you X values and column 5 is your y value

EDIT, Tangent on float values:

Integers are well definied in binary i.e. 1101 is 13. Floats have a problem though, you need to make a tradeoff between accuracy, as in decimal places, and min/max values so you dont have constant buffer overflows. So you have a fixed amount of bits responsible for your decimal places and another fixed amount for your exponent. You can read up on it here.

This number in memory is always the same. What you are observing is the representation as a string when you print it. Numpy generally uses the scientific notation with is the same as format(x,'1.8e') for floats. If you want to print it in a different way use those format string to format it however you like for example you could use format(x,'1.1f') to give you the full number with a single decimal place.

edited Apr 19 '22 at 12:44

answered Apr 19 '22 at 07:34

Eumel

1,298
1
9
19

note that in my case X is spelled uppercase – france1 Apr 19 '22 at 09:09
I don't understand why float makes numbers like 4.45500000e+03, are you sure it's working? It works without float but I need float numbers. – france1 Apr 19 '22 at 09:24
@france1 those are floats, its just a different output representation. There are some numpy flags you can set to force one kind or another but in memory this is a float. – Eumel Apr 19 '22 at 09:33
They are too large. The code is running into buffer overflows. Something could be wrong with float(). If I run the code without the `dataline=float(dataline)` it works how I want except that the numbers are strings, so unusable for an AI. Isn't 4.45500000e+03 4.45500000^3? Why are those zeroes here? – france1 Apr 19 '22 at 09:52
I have to shorten those numbers. I'd like to have a maximum of 3 decimal places so I did following: dataline=round(dataline,3) Still huge numbers, it doesn't work. `RuntimeWarning: overflow encountered in double_scalars error=(y_ -s)**2` Once I fixed my problem with float() I can mark this question as the solution. – france1 Apr 19 '22 at 09:58
I don't understand it. 4.45500000e+03 should be 4455.0 so why is it doing it so complicated. – france1 Apr 19 '22 at 10:08
There are cases where s is nan or inf. s is the weighted x array. – france1 Apr 19 '22 at 10:14
@france1 i added an explanation on floats. Nan and infs in your data are a completely different problem to which you can find some info here https://stackoverflow.com/questions/11620914/removing-nan-values-from-an-array – Eumel Apr 19 '22 at 12:47
It turns out that the numbers are too large. There's a problem in the AI which - I tested - should work. I think the overflow causes the inf, -inf and endless nans afterwards. I can round the numbers but still have huge numbers. I rounded up to 3 decimal places and the numbers are crazy big in the X array but not in the data.txt. What might it be? I rounded -3 and the AI still gets this: `S is: 0.0 S is: -8.776351721574362e+301 S is: inf S is: nan S is: nan` But no overflows – france1 Apr 21 '22 at 08:08
Oh the error message is on the beginning of the output, that's why I didn't see it. – france1 Apr 21 '22 at 08:29

How can I format a file to a multidimensional numpy array for my AI

1 Answers1

Linked