How to prepare the multilevel multivalued training dataset in python

Question

I am a beginner in machine learning. My academic project involves detecting human posture from acceleration and gyro data. I am stuck at the beginning itself. My accelerometer data has x,y,z values and gyro also has x,y,z values stored in file acc.csv and gyro.csv. I want to classify the 'standing', 'sitting', 'walking' and 'lying' position. The idea is to train the machine using some ML algorithm (supervised) and then throw a new acc + gyro data set to identify what this new dataset predict (what the subject is doing at present). I am facing the following problems--

Constructing a training dataset -- I think my activities will be dependent variable, and acc & gyro axis readings will be independent. So if I like to combine it in single matrix with each element of the matrix again has it's own set of acc and gyro value [Something like main and sub matrix], how can I do that? or is there any alternative idea to do the same?
How can I take the data of multiple activities with multiple readings in a single training matrix, I mean 10 walking data each with it's own acc(xyz) and gyro (xyz) + 10 standing data each with it's own acc(xyz) and gyro (xyz) + 10 sitting data each with it's own acc(xyz) and gyro (xyz) and so on.
Each data file has different number of records and time stamp, how to bring them into a common platform. I know I am asking very basic things but these are the confusion part nobody has clearly explained to me. I am feeling like standing in front of a big closed door, inside very interesting things are happening where I cannot participate at this moment with my limited knowledge. My mathematical background is high school level only. Please help.

I have gone through some projects on activity recognition in Github. But they are way too complicated for a beginner like me.

import pandas as pd
import os
import warnings 
from sklearn.utils import shuffle
warnings.filterwarnings('ignore')
os.listdir('../input/testtraindata/')

base_train_dir = '../input/testtraindata/Train_Set/'
#Train Data
train_data = pd.DataFrame(columns = ['activity','ax','ay','az','gx','gy','gz'])
train_folders = os.listdir(base_train_dir)

for tf in train_folders:
    files = os.listdir(base_train_dir+tf)
    for f in files:
        df = pd.read_csv(base_train_dir+tf+'/'+f)
        train_data = pd.concat([train_data,df],axis = 0)
train_data = shuffle(train_data)
train_data.reset_index(drop = True,inplace = True)
train_data.head()

The Data Set

Problem in Train_set

Surprisingly if I remove the last 'gz' from

train_data = pd.DataFrame(columns =['activity','ax','ay','az','gx','gy','gz'])

Everything is working fine.

score 1 · Answer 1 · answered May 22 '19 at 10:59

1

You have the data labeled? --> position of x,y,z... = positure?

I have no clue about the values (as I have not seen the dataset, and have no clue about positions, acc or gyro), but Im guessing you should have a dataset within a matrise with x, y, z as categories and a target category ;"positions".

If you need all 6 (3 from one csv and 3 from the other) to define the positions you can make 6 categories + positions.

Something like : x_1, y_1 z_1 , x_2, y_2, and z_2 + position label ("position" category).

You can also make each position an own category with 0/1 as true/false. "sitting" , "walking" etc... and have 0 and 1 as the values in the columns.

Is the timestamp of any importance towards the position? If it is not a feature of importance I would just drop it. If it is important in some way, you might want to bin them.

Here is a beginners guide from Medium in which you can see a bit how to preprocess your data. It also shows one hot encoding :)

https://medium.com/hugo-ferreiras-blog/dealing-with-categorical-features-in-machine-learning-1bb70f07262d

Also try googling Preprocessing your data, then you will probably find the right recipe

answered May 22 '19 at 10:59

Muriel

27
6

Thanks for the elaborated reply. I have labelled the data that is for posture sit1= xyz of acc+xyz of gyro , sit2=xyz of acc + zyz of gyro ....sit10=xyz of acc+xyz of gyro in folder 'sitting'. Similarly for other activities. – Bukaida May 22 '19 at 15:28
Timestamp is not important – Bukaida May 22 '19 at 15:29
Then I think the second approach might be best: to set it up as one csv: with x_1 .....and the sit as different categories for each posture...? – Muriel May 22 '19 at 15:35
Ok, doing it and posting some datasets here so that you can have a look. – Bukaida May 22 '19 at 15:38
Actually I did upto the combination part that is a csv file is formed using 100 records from each acc.json and gyro.json into a file sit1.csv that has the fields-- sitting Ax Ay Az Gx Gy Gz Now I have narrowed down the problem. for each y-train= sitting I have 100 rows of x-train values. Now how to feed it to x-train ? All the examples I am seeing is considering one row of X-train for one row of y-train [ one to one relation] but in my case 100 rows of x-train for one row of y-train [many to one relation]. I think a 2-D list can be useful but just do not have any idea how to implement it. – Bukaida May 23 '19 at 06:58
Can you make the matrise and post it here? – Muriel May 23 '19 at 07:04
Here is the link to download the file. It is in the format sitting Ax Ay Az Gx Gy Gz https://drive.google.com/file/d/1AaywS79GT8psL8-Sb2eljo2cGO4QuYqw/view?usp=sharing – Bukaida May 23 '19 at 08:10
Post the matrise instead if you need more feedback. The output in df.head() – Muriel May 24 '19 at 07:13
I have tried to construct the matrix but getting NaN error for a column. I am sharing the screenshot, code and Dataset. – Bukaida May 25 '19 at 04:07
Have you removed the NaN values from the dataset? ( fillna) – Muriel Jun 03 '19 at 12:20
Yes, there was an error in input data format. Now it is working. – Bukaida Jun 03 '19 at 16:56

How to prepare the multilevel multivalued training dataset in python

1 Answers1