0

I have this code

from sklearn import tree
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
train["Embarked"] = train["Embarked"].fillna("S")
train["Age"] = train["Age"].fillna(train["Age"].median())
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)
test.Fare[152] = test["Fare"].median()
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
test["Embarked"] = test["Embarked"].fillna("S")
test["Age"] = test["Age"].fillna(test["Age"].median())
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values
my_prediction = my_tree_one.predict(test_features)
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId)
my_solution.to_csv("5.csv", index_label = ["PassangerId", "Survived"])

As you can see I only want save a csv with two columns, but when I look at the file 5.csv it's added another column called 0..Anybody know why?

mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
Ulises 2010
  • 478
  • 1
  • 6
  • 16

2 Answers2

0

You're seeing this behaviour because you're adding two index_labels when there is only one index.

You can instead name your one column as such:

my_solution.columns = ['Survived']

And then label your index like so:

my_solution.to_csv("5.csv", index_label=["PassengerId"])
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
0

Try this slightly optimized solution:

from sklearn import tree

train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
cols = ["Pclass", "Sex", "Age", "Fare"]

mappings = {
  'Sex': {'male':0, 'female':1},
}

def cleanup(df, mappings=mappings):
    # map non-numeric columns
    for c in mappings.keys():
        df[c] = df[c].map(mappings[c])
    # replace NaN's with average value
    for c in df.columns[df.isnull().any()]:
        df[c].fillna(df[c].mean(), inplace=True)
    return df

# parse train data set
train = cleanup(d.read_csv(train_url, usecols=cols + ['Survived']))
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one.fit(train.drop('Survived',1), train['Survived'])

# parse test data set
test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url, usecols=cols+['PassengerId'])
result = test.pop('PassengerId').to_frame('PassengerId')
test = cleanup(test)

result['Survived'] = my_tree_one.predict(test)
result.to_csv("5.csv", index=False)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419