Divide dataframe into two sets according to a column

Question

I have Dataframe df i choosed some coulmns of it and i want to divide them into xtrain and xtest accoring to a coulmn called Sevrice. So that raws with 1 and o into the xtrain and nan into xtest.

Service
1
0
0
1
Nan
Nan

xtarin = df.loc[df['Service'].notnull(), ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]

EDITED

    ytrain = df['Service'].dropna()
    Xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]
    import pandas as pd
    from sklearn.linear_model import LogisticRegression
    logistic = LogisticRegression()
    logistic.fit(xtrain, ytrain)
    logistic.predict(xtest)

I get this error for logistic.predict(xtest)

X has 220 features per sample; expecting 307

jezrael · Accepted Answer · 2017-01-01T13:56:32.460

0

I think you need isnull:

Xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]

Another solution is invert boolean mask by ~:

mask = df['Service'].notnull()
xtarin = df.loc[mask, ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]
Xtest = df.loc[~mask, ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]

EDIT:

df = pd.DataFrame({'Service':[1,0,np.nan,np.nan],
                   'Age':[4,5,6,5],
                   'Fare':[7,8,9,5],
                   'GSize':[1,3,5,7],
                   'Deck':[5,3,6,2],
                   'Class':[7,4,3,0],
                    'Profession_title':[6,7,4,6]})

print (df)
   Age  Class  Deck  Fare  GSize  Profession_title  Service
0    4      7     5     7      1                 6      1.0
1    5      4     3     8      3                 7      0.0
2    6      3     6     9      5                 4      NaN
3    5      0     2     5      7                 6      NaN

ytrain = df['Service'].dropna()
xtrain = df.loc[df['Service'].notnull(), ['Age','Fare', 'GSize','Deck','Class', 'Profession_title' ]]
xtest=df.loc[df['Service'].isnull(),['Age','Fare','GSize','Deck','Class','Profession_title']]
import pandas as pd
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(xtrain, ytrain)
print (logistic.predict(xtest))
[ 0.  0.]

edited Jan 01 '17 at 13:56

answered Jan 01 '17 at 13:27

jezrael

822,522
95
1,334
1,252

Thanks, do you have any idea why do i get this error X has 220 features per sample; expecting 307 – Jan 01 '17 at 13:42
It seems some problem with data, I test it with some sample and it works, see edit. – jezrael Jan 01 '17 at 13:57
Thank you for accepting. I try your code with your csv and same problem. Problem is `xtrain` and `xtest` have different length of columns, `print (xtrain.info()) print (xtest.info())` – jezrael Jan 01 '17 at 14:21
And solution is `xtest = xtest.reindex(columns=xtrain.columns, fill_value=0) print(logistic.predict(xtest))` – jezrael Jan 01 '17 at 14:25
[`reindex`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) by columns, if `NaN` get `0` - so this code add all missing columns to `xtest` and fill them by `0` – jezrael Jan 01 '17 at 14:27
Thank you so much. – Jan 01 '17 at 14:33

Divide dataframe into two sets according to a column

1 Answers1