0

I'm trying to play with building a machine learning model and the results are really bad despite the answer being fairly simple. I know I'm doing something wrong but not sure where.

Here's what I'm doing:

  1. I have a data set of corporate financials.
  2. To make it easy I'm trying to predict gross profit(total revenue - cost of revenue)
  3. to make it even easier I'm actually calculating that value into a pandas column myself df['grossProfit'] = df['totalRevenue'] - df['costOfRevenue']
  4. To make it a little challenging, I have a column called exchange which is a category in numerical format(1,2,3,etc).

My goal is to simple predict grossProfit which I thought would be easy since 100% of the data calculate it is in my dataset but when I run the model I get up to 6% accuracy. I would expect it would be closer to 100% since the model should figure out totalRevenue + Cost of revenue = grossprofit.

Here's my data:

    grossProfit totalRevenue    Exchange    costOfRevenue
0   9.839200e+10    2.601740e+11    NASDAQ  1.617820e+11
1   9.839200e+10    2.601740e+11    NASDAQ  1.617820e+11
2   1.018390e+11    2.655950e+11    NASDAQ  1.637560e+11
3   1.018390e+11    2.655950e+11    NASDAQ  1.637560e+11
4   8.818600e+10    2.292340e+11    NASDAQ  1.410480e+11
... ... ... ... ...
186 4.224500e+10    9.113400e+10    NYSE    4.888900e+10
187 4.078900e+10    9.629300e+10    NYSE    5.550400e+10
188 3.748200e+10    8.913100e+10    NYSE    5.164900e+10
189 3.397500e+10    8.118600e+10    NYSE    4.721100e+10
190 3.597700e+10    8.586600e+10    NYSE    4.988900e+10
191 rows × 4 columns

Here is my code to normalize/scale the data:

df['grossProfit'] = df['totalRevenue'] - df['costOfRevenue'] #very bad REMOVE ASAP just for testing
variableToPredict = ['grossProfit']
df2=df[['grossProfit','totalRevenue','Exchange', 'costOfRevenue']]

#scale the data

#isolate the data
PredictionDataSet=df2[df2[variableToPredict].notnull().all(1)] # contains no missing values 
X_missing=df2[df2[variableToPredict].isnull().all(1)]  #---> contains missing values 

#gather numeric/catergory objects
columnsNumeric = list(PredictionDataSet.select_dtypes(include=['float']).columns)
columnsObjects = list(PredictionDataSet.select_dtypes(include=['object']).columns)

#scale catergories
encoder = OrdinalEncoder()
PredictionDataSet["Exchange"] = encoder.fit_transform(PredictionDataSet.Exchange.values.reshape(-1, 1))

#create test/train datasets
X_train=PredictionDataSet
X_train = X_train.drop(columns=variableToPredict)
y_train=PredictionDataSet[variableToPredict]

#transforming the input features
PredictionDataSet[columnsNumeric] = MinMaxScaler().fit_transform(PredictionDataSet[columnsNumeric])
#transforming the input features
scaler_features = MinMaxScaler()
scaler_features.fit(X_train)
X_train = scaler_features.transform(X_train)

#transforming the input values
scaler_values = MinMaxScaler()
y_train=np.asarray(y_train).reshape(-1,1)
scaler_values.fit(y_train)
y_train=scaler_values.transform(y_train)

print("Shape of input features: {} ".format(X_train.shape))
print("Shape of input target values : {} ".format(y_train.shape))
numInputColumns = X_train.shape[1]

Shape of input features: (191, 3) 
Shape of input target values : (191, 1) 
3

Here's my model:

###### model

model = tf.keras.Sequential() #using tensorflow keras
model.add(layers.Dense(64, activation='relu', input_shape=(numInputColumns,)))
model.add(layers.Dense(128))
model.add(layers.Dense(128))
model.add(layers.Dense(128))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
model.fit(X_train,y_train,epochs=10,validation_split=0.2)

I am certain I am making some big mistake somewhere, I'm just new to machine learning so I'm not exactly sure where.

halfer
  • 19,824
  • 17
  • 99
  • 186
Lostsoul
  • 25,013
  • 48
  • 144
  • 239

1 Answers1

0

For starters:

  1. You are in a regression setting, where accuracy is meaningless (it is only used in classification problems). Remove metrics=['accuracy'] from your model compilation and don't bother about it - you should evaluate the performance of your model with the same quantity you use as a loss (here MSE).

  2. For the same reason (regression problem), you should not use a sigmoid activation for your last layer, but the linear one (leaving in just as Dense(1) will also do the job, since linear is the default activation for Keras layers).

  3. Intermediate layers with linear activations (as yours here) amount to just a single-node linear layer each (i.e. practically nothing); add a relu activation to all of your intermediate layers (just as you do for your first one).

All in all, here is how your starting point for experimentation should be:

model = tf.keras.Sequential() 
model.add(layers.Dense(64, activation='relu', input_shape=(numInputColumns,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

When you are done with this part, you will eventually reach the unfortunate conclusion that, in contrast with classification problems where we can say immediately if the accuracy is "good", "not good enough", "bad" etc., the performance metrics of regression problems, like MSE, do not let themselves into such easy assessments; even worse, your MSE is calculated on your scaled y data. Read my answer in How to interpret MSE in Keras Regressor to see how you can calculate the MSE in your initial, unscaled data, take its square root, and thus being able to compare it in the units of your original data, in order to see if it is satisfactory or not for your case (a part usually omitted in ML tutorials)...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Thank you so much. I made the changes and the MSE is low - MSE: 0.001. Not sure if I can ask in this question but it leads me to another question, is MSE the best evaluation for regression problems? – Lostsoul Mar 26 '20 at 16:39
  • or follow up, what is the best way to assess performance of a regression problem? – Lostsoul Mar 26 '20 at 16:39
  • 1
    @Lostsoul unsurprisingly, there is no such "best" evaluation metric (if it existed, everything else would be eventually forgotten!). Read the updated part at the end and the linked answer to see how to translate your predictions back to the scale of your original data. Only then you will be able to judge if you are really satisfied with the results of your model. – desertnaut Mar 26 '20 at 16:47
  • Thank you very much. Your answer and links are super helpful. – Lostsoul Mar 26 '20 at 17:19