1

Can you please help me?

I have two pandas dataframes and I need to extract every tenth value from the specific column from the first dataframe in a list. Then I need to create a new column in a second dataframe and put these values in this column. I also need my number of a row to match in both dataframes. Here is the code:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# we will use random forest classifier as our classifier
forest_classifier=RandomForestClassifier(max_depth=4)

# reading in accelerometer data
train_time = pd.read_csv("https://courses.edx.org/assets/courseware/v1/b98039c3648763aae4f153a6ed32f38b/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_time_series.csv", index_col=0)
train_labels = pd.read_csv("https://courses.edx.org/assets/courseware/v1/d64e74647423e525bbeb13f2884e9cfa/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_labels.csv", index_col=0)

x = []
y = []
z = []

# making lists out of the x, y, z columns
for i in range(3, len(train_time), 10):
    x.append(train_time.iloc[i]["x"])
    y.append(train_time.iloc[i]["y"])
    z.append(train_time.iloc[i]["z"])
    
print(z)  # checking the list

# making new columns in train_labels file with the obtained lists
train_labels["x"] = pd.Series([x])
train_labels["y"] = pd.Series([y])
train_labels["z"] = pd.Series([z])

train_labels.head()

But I get the output, where the created columns have just the values of "NaN"

The output should be the dataframe with the created x, y, z columns, which have corresponding numbers of observations.

enter image description here

1 Answers1

2

You can stack x, y and z and use:

train_labels[['x', 'y', 'z']] = np.stack([x, y, z], axis=1)

Output:

>>> train_labels
           timestamp                 UTC time  label         x         y         z
20589  1565109931087  2019-08-06T16:45:31.087      1 -0.053802 -0.987701  0.068985
20599  1565109932090  2019-08-06T16:45:32.090      1  0.013718 -0.852371 -0.000870
20609  1565109933092  2019-08-06T16:45:33.092      1  0.145584 -1.007843 -0.036819
20619  1565109934094  2019-08-06T16:45:34.094      1 -0.099380 -1.209686  0.304489
20629  1565109935097  2019-08-06T16:45:35.097      1  0.082794 -1.001434 -0.025375
...              ...                      ...    ...       ...       ...       ...
24289  1565110302030  2019-08-06T16:51:42.030      4 -0.641953 -1.469177  0.301041
24299  1565110303032  2019-08-06T16:51:43.032      4 -0.171616 -0.366074 -0.059082
24309  1565110304034  2019-08-06T16:51:44.034      4  0.401810 -1.077698  0.258911
24319  1565110305037  2019-08-06T16:51:45.037      4  0.330338 -1.470062  0.303894
24329  1565110306039  2019-08-06T16:51:46.039      4  0.689346 -0.991043  0.034973

[375 rows x 6 columns]

Note 1: your code doesn't work because you create a pd.Series without using indexes from train_labels so the indexes are misaligned:

train_labels['x'] = pd.Series(x, index=train_labels.index)
train_labels['y'] = pd.Series(y, index=train_labels.index)
train_labels['z'] = pd.Series(z, index=train_labels.index)

Note 2: you can also avoid the loop:

x, y, z = train_time.iloc[3::10][['x', 'y', 'z']].T.values

So the code could be:

x, y, z = train_time.iloc[3::10][['x', 'y', 'z']].T.values
train_labels[['x', 'y', 'z']] = np.stack([x, y, z], axis=1)

# Or (if you needn't x, y, and z)
train_labels[['x', 'y', 'z']] = train_time.iloc[3::10][['x', 'y', 'z']
Corralien
  • 109,409
  • 8
  • 28
  • 52