-1

I have a data set with three columns,the first two columns are the features and the third column contain classes,there are 4 classes,part of it can be seen here.

enter image description here

The data set is big,lets say 100,000 rows and 3 columns(two column features and one column for classes),so I am using a moving window of length 50 on the data set before training my deep learning model. So far I have tried two different method to slice the data set with no good results and I am pretty sure my data set is good. I first used a moving window on my entire data set,resulting into 2000 data samples with 50 rows and 2 columns(2000,50,2). As some data samples contain mixed classes,I selected only data samples with common classes and find the average of the classes to assign that particular data sample into a single class only,I have not get results with this.Here are my codes,`

def moving_window(data_, length, step=1):
    streams = it.tee(data_, length)
    return zip(*[it.islice(stream, i, None, step * length) for stream, i in zip(streams, it.count(step=step))])


data = list(moving_window(data_, 50))
data = np.asarray(data)
# print(len(data))
for i in data:
    label=np.all(i==i[0,2],axis=0)

    if label[2]==True:
        X.append(i[:,0:2])
        Y.append(sum(i[:,2])/len(i[:,2]))`

I tried another way by collecting only features corresponding to a particular class,putting the values into separate lists(4 lists as I have 4 classes) then used a moving window to slice each list separately and assign to its class. No good results too.Here are my codes.

for i in range(5):
    labels.append(i)
yy= pd.get_dummies(labels)
yy= yy.values
yy= yy.astype(np.float32)


def moving_window(x, length, step=1):
    streams = it.tee(x, length)
    return zip(*[it.islice(stream, i, None, step * length) for stream, i in zip(streams, it.count(step=step))])


x_1 = list(moving_window(x1, 50))
x_1 = np.asarray(x_1)
y_1 = [yy[0]] * len(x_1)
X.append(x_1)
Y.append(y_1)
# print(x_1.shape)

x_2 = list(moving_window(x2, 50))
x_2 = np.asarray(x_2)
# print(yy[1])
y_2 = [yy[1]] * len(x_2)
X.append(x_2)
Y.append(y_2)
# print(x_2.shape)

x_3 = list(moving_window(x3, 50))
x_3 = np.asarray(x_3)
# print(yy[2])
y_3 = [yy[2]] * len(x_3)
X.append(x_3)
Y.append(y_3)
# print(x_3.shape)

x_4 = list(moving_window(x4, 50))
x_4 = np.asarray(x_4)
# print(yy[3])
y_4 = [yy[3]] * len(x_4)
X.append(x_4)
Y.append(y_4)
# print(x_4.shape)

the architecture of the model which I am trying to train works perfect with other data set. So I think I am missing something on how I process the data.What am I missing on my ways of processing the data before I start training?,is there any other way?. All the work done is in python.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
dm5
  • 350
  • 1
  • 6
  • 18
  • What is you question about? We know nothing about the data or the ML model you are using, so how do you expect help on who to format the data? To get an answer on here you need to supply raw data and a sample expected output. As it stands now this question is extremely unclear – DJK Aug 21 '17 at 19:39
  • I want to use the data to train a CNN model,the expected data format should be like this (?,?,2) belongs to a single class. Lets say we have 50,000 data points belong to class 1 and we use a moving window of 50,it will look like (1000,50,2) and its class/label 1(0100 as I will use one hot encoding),same for the rest of the classes.My data set is in CSV and in a format as can been seen in a screen shot above which I just represented only part of it as each class contains more than 40,000 rows.Sorry for not being clear, Hope I am clear now@djk47463 – dm5 Aug 22 '17 at 12:45
  • When you say moving window, it's very confusing, because that is tied to moving statistics, when instead I think you just mean the second dimension, correct? – DJK Aug 22 '17 at 13:35
  • moving window meaning I slice the dataset into 50 rows, then move to the next slot of 50 rows again.It is like having a slide window with a step size the same as the window size@djk47463 – dm5 Aug 22 '17 at 17:11
  • I would answer this for you, but I think you need to learn CNN's better. You are combining random sets of features and saying all those features belong together, but they may not, even though the features are of the same class. CNN's work for images, because a single image has multiple features that belong together. If you think in reference to prictures, your essentially taking random parts of pictures and putting them together. That wont work right. – DJK Aug 22 '17 at 18:07
  • also, 2000 samples is tiny for such a deep network. These netowkrs usually cantain 100's of thousands of samples. Also averaging the classes is not a good approach. The average of 1 and 4 is 2.5 rounding to 3. How does the combination of 1 and 4 belong in class 3 or if you round down 2? that approach does not make sense. Seems you would be better off with a standard fully dense network with 2 inputs – DJK Aug 22 '17 at 18:09
  • actually it worked with both of my two ways, I just needed normalization of my features@djk47463 – dm5 Aug 24 '17 at 19:05
  • one point to make it clear,why I calculated the mean, if you follow my codes it will make sense to you. After I used a moving window of 50,I resulted into several windows of shape 50,3. I kept windows with common classes ONLY,see my first block of codes in my post.I wanted each window to have only one class,but I had 50 same values . Finding mean of these 50 same values gave me only one value which is the same as the rest of the values.Instead of doing this you can also slice your class column and take only value,i.e i[0,2],So each window had only one class at the end.@djk47463 – dm5 Aug 25 '17 at 11:29

1 Answers1

0

I finally managed to train my CNN model and achieved good training,validation and testing accuracy. The only thing I added was normalization of my input data with the following lines,

minmax_scale = preprocessing.MinMaxScaler().fit(x)
X = minmax_scale.transform(x)

The rest remains the same.

dm5
  • 350
  • 1
  • 6
  • 18