14

I have a dataframe in pandas that I'm using to produce a scatterplot, and want to include a regression line for the plot. Right now I'm trying to do this with polyfit.

Here's my code:

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from numpy import *

table1 = pd.DataFrame.from_csv('upregulated_genes.txt', sep='\t', header=0, index_col=0)
table2 = pd.DataFrame.from_csv('misson_genes.txt', sep='\t', header=0, index_col=0)
table1 = table1.join(table2, how='outer')

table1 = table1.dropna(how='any')
table1 = table1.replace('#DIV/0!', 0)

# scatterplot
plt.scatter(table1['log2 fold change misson'], table1['log2 fold change'])
plt.ylabel('log2 expression fold change')
plt.xlabel('log2 expression fold change Misson et al. 2005')
plt.title('Root Early Upregulated Genes')
plt.axis([0,12,-5,12])

# this is the part I'm unsure about
regres = polyfit(table1['log2 fold change misson'], table1['log2 fold change'], 1)

plt.show()

But I get the following error:

TypeError: cannot concatenate 'str' and 'float' objects

Does anyone know where I'm going wrong here? I'm also unsure how to add the regression line to my plot. Any other general comments on my code would also be hugely appreciated, I'm still a beginner.

TimStuart
  • 434
  • 3
  • 6
  • 9
  • at what line do you get the error? – usethedeathstar Oct 15 '13 at 11:18
  • @usethedeathstar `regres = polyfit(table1['log2 fold change misson'], table1['log2 fold change'], 1)` – TimStuart Oct 15 '13 at 11:38
  • sure there are no NaN values in the tables? because pylab.scatter just does not plot the x,y points where x or y are NaN (meaning that it doesnt give errors either), but maybe polyfit doesnt know that? (just guessing at where the problem might be here- how are not-a-number values stored in your csv-file?) – usethedeathstar Oct 15 '13 at 11:48
  • Nope no NaN values. The only not-a-number values were '#DIV/0!', which I removed – TimStuart Oct 15 '13 at 11:58
  • what is the type of table1['log2 fold change misson'] and table1['log2 fold change'] ? (as far as i know they should be numpy.array with float as dtype (and both should have the same shape)) – usethedeathstar Oct 15 '13 at 12:01
  • They're both pandas series – TimStuart Oct 15 '13 at 12:26

1 Answers1

28

Instead of replacing '#DIV/0!' by hand, force the data to be numeric. This does two things at once: it ensures that the result is numeric type (not str), and it substitutes NaN for any entries that cannot be parsed as a number. Example:

In [5]: Series([1, 2, 'blah', '#DIV/0!']).convert_objects(convert_numeric=True)
Out[5]: 
0     1
1     2
2   NaN
3   NaN
dtype: float64

This should fix your error. But, on the general subject of fitting a line to data, I keep handy two ways of doing this that I like better than polyfit. The second of the two is more robust (and can potentially return much more detailed information about the statistics) but it requires statsmodels.

from scipy.stats import linregress
def fit_line1(x, y):
    """Return slope, intercept of best fit line."""
    # Remove entries where either x or y is NaN.
    clean_data = pd.concat([x, y], 1).dropna(0) # row-wise
    (_, x), (_, y) = clean_data.iteritems()
    slope, intercept, r, p, stderr = linregress(x, y)
    return slope, intercept # could also return stderr

import statsmodels.api as sm
def fit_line2(x, y):
    """Return slope, intercept of best fit line."""
    X = sm.add_constant(x)
    model = sm.OLS(y, X, missing='drop') # ignores entires where x or y is NaN
    fit = model.fit()
    return fit.params[1], fit.params[0] # could also return stderr in each via fit.bse

To plot it, do something like

m, b = fit_line2(x, y)
N = 100 # could be just 2 if you are only drawing a straight line...
points = np.linspace(x.min(), x.max(), N)
plt.plot(points, m*points + b)
Dan Allan
  • 34,073
  • 6
  • 70
  • 63
  • Thanks! Forcing the data to be numeric has fixed the error I was getting, but I'm getting NaN output from polyfit and the code you suggested... Any idea why this could be? – TimStuart Oct 15 '13 at 12:29
  • Some NaNs, or all NaNs? Can you reproduce the problem with a small subset of your data, and share that here? – Dan Allan Oct 15 '13 at 12:30
  • Sorry that was just a mistake by me, it's working now. Do you know how I would add this as a line to my scatterplot? – TimStuart Oct 15 '13 at 12:44
  • See the bottom of my answer. If it's not showing up in the same plot, try adding the keyword argument ``ax=plt.gca()`` in ``plot``. – Dan Allan Oct 15 '13 at 12:47