I'm trying to play with building a machine learning model and the results are really bad despite the answer being fairly simple. I know I'm doing something wrong but not sure where.
Here's what I'm doing:
- I have a data set of corporate financials.
- To make it easy I'm trying to predict gross profit(total revenue - cost of revenue)
- to make it even easier I'm actually calculating that value into a pandas column myself
df['grossProfit'] = df['totalRevenue'] - df['costOfRevenue']
- To make it a little challenging, I have a column called exchange which is a category in numerical format(1,2,3,etc).
My goal is to simple predict grossProfit which I thought would be easy since 100% of the data calculate it is in my dataset but when I run the model I get up to 6% accuracy. I would expect it would be closer to 100% since the model should figure out totalRevenue + Cost of revenue = grossprofit.
Here's my data:
grossProfit totalRevenue Exchange costOfRevenue
0 9.839200e+10 2.601740e+11 NASDAQ 1.617820e+11
1 9.839200e+10 2.601740e+11 NASDAQ 1.617820e+11
2 1.018390e+11 2.655950e+11 NASDAQ 1.637560e+11
3 1.018390e+11 2.655950e+11 NASDAQ 1.637560e+11
4 8.818600e+10 2.292340e+11 NASDAQ 1.410480e+11
... ... ... ... ...
186 4.224500e+10 9.113400e+10 NYSE 4.888900e+10
187 4.078900e+10 9.629300e+10 NYSE 5.550400e+10
188 3.748200e+10 8.913100e+10 NYSE 5.164900e+10
189 3.397500e+10 8.118600e+10 NYSE 4.721100e+10
190 3.597700e+10 8.586600e+10 NYSE 4.988900e+10
191 rows × 4 columns
Here is my code to normalize/scale the data:
df['grossProfit'] = df['totalRevenue'] - df['costOfRevenue'] #very bad REMOVE ASAP just for testing
variableToPredict = ['grossProfit']
df2=df[['grossProfit','totalRevenue','Exchange', 'costOfRevenue']]
#scale the data
#isolate the data
PredictionDataSet=df2[df2[variableToPredict].notnull().all(1)] # contains no missing values
X_missing=df2[df2[variableToPredict].isnull().all(1)] #---> contains missing values
#gather numeric/catergory objects
columnsNumeric = list(PredictionDataSet.select_dtypes(include=['float']).columns)
columnsObjects = list(PredictionDataSet.select_dtypes(include=['object']).columns)
#scale catergories
encoder = OrdinalEncoder()
PredictionDataSet["Exchange"] = encoder.fit_transform(PredictionDataSet.Exchange.values.reshape(-1, 1))
#create test/train datasets
X_train=PredictionDataSet
X_train = X_train.drop(columns=variableToPredict)
y_train=PredictionDataSet[variableToPredict]
#transforming the input features
PredictionDataSet[columnsNumeric] = MinMaxScaler().fit_transform(PredictionDataSet[columnsNumeric])
#transforming the input features
scaler_features = MinMaxScaler()
scaler_features.fit(X_train)
X_train = scaler_features.transform(X_train)
#transforming the input values
scaler_values = MinMaxScaler()
y_train=np.asarray(y_train).reshape(-1,1)
scaler_values.fit(y_train)
y_train=scaler_values.transform(y_train)
print("Shape of input features: {} ".format(X_train.shape))
print("Shape of input target values : {} ".format(y_train.shape))
numInputColumns = X_train.shape[1]
Shape of input features: (191, 3)
Shape of input target values : (191, 1)
3
Here's my model:
###### model
model = tf.keras.Sequential() #using tensorflow keras
model.add(layers.Dense(64, activation='relu', input_shape=(numInputColumns,)))
model.add(layers.Dense(128))
model.add(layers.Dense(128))
model.add(layers.Dense(128))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.fit(X_train,y_train,epochs=10,validation_split=0.2)
I am certain I am making some big mistake somewhere, I'm just new to machine learning so I'm not exactly sure where.