1

I received this error when trying to compile my code. I extracted data from xlsx file and created a dataframe ,replaced null values with 0, converted all the values to sting to be able to scatterplot and when i tried to show the results of my linear regression I received this error.

 TypeError: unsupported operand type(s) for /: 'str' and 'int'

and this is the code I did so far

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def predict(x):
return slope * x + intercept
from scipy import stats
xlsxfile = pd.ExcelFile("C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx") 
data = xlsxfile.parse('Sheet1', index_col = None, header = None) 
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
data1 = data1.astype(str)
print(data1)
X = data1.iloc[0:len(data1),1] 
print(X)
Y = data1.iloc[0:len(data1),2] 
print(Y)
axes = plt.axes()
axes.grid() 
plt.scatter(X,Y)     
slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)

To notice that I am a beginner with this. The last line is causing the error This is the first columns COP COR and PAUS of the dataframe which I am trying to apply some Linear regression on:

 0            PP   SP000045856 COP COR  SP000045856 PAUS   
 1          201723                    0              2000   
 2          201724                12560             40060   
 3          201725               -17760             15040   
 4          201726                -5840             16960   
 5          201727                10600             4480   
 6          201728                    0             14700   
 7          201729                 4760             46820  

... till line 27

Ahmed Achour
  • 105
  • 1
  • 11
  • Hello, welcome to SO. Why `converted all the values to string to be able to scatterplot`?? Why do you think you need strings to be able to create a scatter plot? However - `stats.linregress` needs array(s) of numbers, because it calculates the linear regression of e.g. some measured, i.e. e.g. noisy data, over some independant variable, e.g. time or whatever... Perhaps you have a short look at the documentation of this function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html – SpghttCd Sep 25 '18 at 07:59
  • And just to be complete - my recommendation is: just do not cast your data to string for that what you want to achieve. The fact, that scatter _works_ and _throws no error_ with two string arrays as arguments doesn't necessarily mean, that the result is useful or meaningful for you. – SpghttCd Sep 25 '18 at 08:05
  • if I remove the astype line I receive this: TypeError: 0 is not a string in the scatter plot line that's why I converted to string do you have any idea how to scatter plot without receiving this error – Ahmed Achour Sep 25 '18 at 08:09
  • Please post your dataframe as a sample, so that we can see the data you're dealing with. (No screenshot please, post it like code, and if it's too large perhaps just `data.head()`) – SpghttCd Sep 25 '18 at 08:11
  • I could reproduce your error by providing a list with both strings _and_ ints to `plt.scatter`. Is it possible, that you perhaps have header names in the first row of your data...? – SpghttCd Sep 25 '18 at 08:18
  • Exactly. Try first with `header=True` in the line where you parse your data... However - perhaps you could optimize your data import a little. I'm not too experienced with Excel-imports, but I'd try first simply `data = pd.read_excel(filename, sheetname, ....)` But I don't know your Excelfile, perhaps you have a reason to do it differently... – SpghttCd Sep 25 '18 at 08:21
  • changing to header=True I received this: TypeError: Passing a bool to header is invalid. Use header=None for no header or header=int or list-like of ints to specify the row(s) making up the column names – Ahmed Achour Sep 25 '18 at 08:24
  • Hmm - hard to guess now, as I do not know how your data is structured in the Excel sheet. But the way for you to go is definitely to fix your data import. It's not working up to now. I can not imagine which `bool`could be meant, which pandas tries to use as header... For a first test you could use an additional parameter `skiprows=1` together with `header=None` to really get only the numbers. But in the end you should be able to read the sheet including header information, so that you do nat have to add it manually afterwards. And to achieve this, the import has to suit the data structure. – SpghttCd Sep 25 '18 at 08:32
  • Perhaps you can post the first lines of the excel file, too... – SpghttCd Sep 25 '18 at 08:34
  • the firstlines are exactly like theones I added in the description of my problem – Ahmed Achour Sep 25 '18 at 10:01
  • I found my mistake: the opposite of `header=None`is not setting it to True. This in fact leads to the error message `TypeError: Passing a bool to header is invalid.` Just delete the header-kwarg completely, then it should work. Sorry... – SpghttCd Sep 25 '18 at 10:41

1 Answers1

1

The data in your Excel file has header information in the first row, so setting header=None is the reason why there are string values in your data instead of putting it as column names.
If you delete the header kwarg

xlsxfile = pd.ExcelFile("C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx") 
data = xlsxfile.parse('Sheet1', index_col = None)

everything should work and you should get a dataframe like this:

data

   0      PP  SP000045856 COP COR  SP000045856 PAUS
0  1  201723                    0              2000
1  2  201724                12560             40060
2  3  201725               -17760             15040
3  4  201726                -5840             16960
4  5  201727                10600              4480
5  6  201728                    0             14700
6  7  201729                 4760             46820

However, you could do the same thing even a little shorter by directly using the read_excel-function of pandas:

data = pd.read_excel('C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx', 'Sheet1')

Your scatter-plot can then be done e.g. like

data.plot('SP000045856 COP COR', 'SP000045856 PAUS', 'scatter')

or perhaps better readable but identical:

data.plot.scatter('SP000045856 COP COR', 'SP000045856 PAUS')

And the linear regression could be done like

slope, intercept, r_value, p_value, std_err = stats.linregress(data['SP000045856 COP COR'], data['SP000045856 PAUS'])
SpghttCd
  • 10,510
  • 2
  • 20
  • 25