8

I am looking at the famous Titanic dataset from the Kaggle competition found here: http://www.kaggle.com/c/titanic-gettingStarted/data

I have loaded and processed the data using:

# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# load the data from the file
df = pd.read_csv('./data/train.csv')

# import the scatter_matrix functionality
from pandas.tools.plotting import scatter_matrix

# define colors list, to be used to plot survived either red (=0) or green (=1)
colors=['red','green']

# make a scatter plot
scatter_matrix(df,figsize=[20,20],marker='x',c=df.Survived.apply(lambda x:colors[x]))

df.info()

scatter_matrix from matplotlib

How can I add the categorical columns like Sex and Embarked to the plot?

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
Geoffrey Stoel
  • 1,300
  • 3
  • 14
  • 24
  • 1
    A scatter plot is not a good choice for categorical variables, so it wouldn't really make sense to "add" those variables to this scatter matrix. You could do a different set of plots involving those variables (for instance, boxplots of each numeric variable grouped by the categories). – BrenBarn Jan 19 '15 at 22:15
  • 1
    BrenBarn - thanks.... I do not fully agree with you... when the factors are limited (like gender: male, female, unknown). I find it very insightfull to approach these like integers like 1,2 and 3 and plot these in a scatterplot. If I remember correctly by heart, R treats its factors in a dataframe like this when plotting a scattermatrix. Was hoping could do the same with pandas. – Geoffrey Stoel Jan 19 '15 at 22:18
  • i think you'll want to look at seaborn's facetgrids and pairgrids for this type of plot: http://web.stanford.edu/~mwaskom/software/seaborn/examples/scatterplot_matrix.html – Paul H Jan 19 '15 at 22:31
  • 1
    Note that for pandas version >0.19, the `from pandas.tools.plotting import scatter_matrix` should be replaced by `from pandas.plotting import scatter_matrix`(cf [reference answer](https://stackoverflow.com/a/44102537/3047493) ) – Luc M Apr 03 '19 at 11:41

3 Answers3

7

You need to transform the categorical variables into numbers to plot them.

Example (assuming that the column 'Sex' is holding the gender data, with 'M' for males & 'F' for females)

df['Sex_int'] = np.nan
df.loc[df['Sex'] == 'M', 'Sex_int'] = 0
df.loc[df['Sex'] == 'F', 'Sex_int'] = 1

Now all females are represented by 0 & males by 1. Unknown genders (if there are any) will be ignored.

The rest of your code should process the updated dataframe nicely.

knightofni
  • 1,906
  • 3
  • 17
  • 22
2

after googling and remembering something like the .map() function I fixed it in the following way:

colors=['red','green'] # color codes for survived : 0=red or 1=green

# create mapping Series for gender so it can be plotted
gender = Series([0,1],index=['male','female'])    
df['gender']=df.Sex.map(gender)

# create mapping Series for Embarked so it can be plotted
embarked = Series([0,1,2,3],index=df.Embarked.unique())
df['embarked']=df.Embarked.map(embarked)

# add survived also back to the df
df['survived']=target

now I can plot it again...and drop the added columns afterwards.

thanks everyone for responding.....

Geoffrey Stoel
  • 1,300
  • 3
  • 14
  • 24
1

Here is my solution:

# convert string column to category
df.Sex = df.Sex.astype('category')
# create additional column for its codes
df['Sex_code'] = df_clean.Sex.cat.codes
Aray Karjauv
  • 2,679
  • 2
  • 26
  • 44