The REAL QUESTION IS MUCH FURTHER BELOW, but this is background info
My end goal is to make a general purpose python file (script? .py file) that opens an excel file, determines how the data is organized, "cleans" the unusable data, then runs a Multiple Linear Regression analysis. I'm stuck at the "cleaning" portion, but I have ideas, just not sure how to go about it.
This is what the data looks like (inside an excel file):
y value data 1 data 2 data 3 data 4 data 5 data 6
282 1 215 169 14 147
148 0 250 307 232 134
351 1 191 343 189 9
31 0 32 327 8 201
33 0 503 484 85 166
973 0 651 134 128
329 0 300 186 195
271 1 543 18
814 1 544 123
274 1 349 209
425 1
So, here is my code so far (with explanations)
import pandas as pd
import numpy as np
import statistics
df =pd.read_excel (r'D:\...data.xlsx')
## find the longest column and longest rows.........https://www.kite.com/python/answers/how-to-count-the-number-of-rows-in-a-pandas-dataframe-in-python
index = df.index
number_of_rows = len(index)
#####find number of columns ............https://www.w3resource.com/python-exercises/pandas/python-pandas-data-frame-exercise-57.php
number_of_columns = len(df.columns)
if number_of_rows>number_of_columns:
row_based = input("it appears that this is a row based file, in other words, the data go from top to bottom, and the top row are the datas' title. If so, press enter, otherwise press any other keyboard and enter. " )
if row_based == "":
######https://stackoverflow.com/questions/23979184/how-to-know-if-a-user-has-pressed-the-enter-key-using-python.................You should use \n as enter. It means the newline character.
Up until now, all I've done is determine which direction the data is in. Now I need to "clean" the data.
######this is the real work indent...##########################
#quantify how many how many empty values are in each column..............https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
num_of_empty_cells_in_columns = df.isnull().sum(axis=0)
# sort the coulmns based on how many empty values they have............https://data-flair.training/blogs/sort-pandas-dataframes-series-array/
columns_pd_sorted = num_of_empty_cells_in_columns.sort_values(ascending=True)
Now that I 'sorted' the columns (based on how many empty cells are in the entire column) I just have to pick the first column as the "lowest value". This means this is either the Y value (for the Multiple Linear Regression analysis later) or a data field that has just as much data as the y value.
#find the lowest value (this is the value of the already sorted array)
lowest_value=columns_pd_sorted[0]
I also wanted to take the average of the all the empty fields. This average (mean) will be used later (I think).
mean_empty_cells=statistics.mean(columns_pd_sorted)
#if the user says its horizontal data (instead of vertical data)
else:
print(" this code hasn't been built yet..")
#if the number of columns exceeds the number of rows (indicates horizontal data series)
else:
print(" this code hasn't been built yet..")
My end goal is:
- ask the user to verify (or choose) the y variable.
- eliminate the columns with mostly empty fields.
- eliminate the rows that don't have have full data (all columns with data).
** !!!!!!!!!!!!!!!!!!!!!How I think I can solve it (THIS IS THE REAL QUESTION)!!!!!!!!!!!!!!!! **
I think I can solve all of this by doing the following, but have NO idea how to write the code. Ultimately my issue is I can't find a way to force pandas to give me the column name AND data.
- Make a list of columns that have equivalent empty cells as lowest_value. I assume some sort of iteration? This code below is literally a stab in the dark.
y_variable_candidates = []
for col in num_of_empty_cells_in_columns:
if col=lowest_value:
y_variable_candidates=y_variable_candidates + col
y_variable = y_variable_candidates[1]
y_variable_confirmation = input('currently your y variable is ' + str(y_variable) +' it appears that there are many y variable candidates, such as' + str(y_variable_candidates) + 'press enter if the current y variable is okay, otherwise press a number key to indicate which column should be the y variable')
#... more code later on
- Again, make a list of all columns that have empty cells that exceed the mean (mean_empty_cells). Again I think I can do this via iteration
- set the y_variable as the first "lowest_value" but ask the user to confirm
- go back to df and drop all columns that match the
mostly_empty_columns = []
for col in num_of_empty_cells_in_columns:
if col>mean_empty_cells:
mostly_empty_columns=mostly_empty_columns + col
#some code to get user to confirm to delete all the selected columns
desired end data:
y value data 1 data 2 data 3 data 5 data 6
282 1 215 169 14 147
148 0 250 307 232 134
351 1 191 343 189 9
31 0 32 327 8 201
33 0 503 484 85 166
I want to the run a Multiple Linear Regression Analysis on the above desired end data.
ANY HELP IS MUCH APPRECIATED THANK YOU!!!