0

The REAL QUESTION IS MUCH FURTHER BELOW, but this is background info

My end goal is to make a general purpose python file (script? .py file) that opens an excel file, determines how the data is organized, "cleans" the unusable data, then runs a Multiple Linear Regression analysis. I'm stuck at the "cleaning" portion, but I have ideas, just not sure how to go about it.

This is what the data looks like (inside an excel file):

y value data 1  data 2  data 3  data 4  data 5  data 6
282     1       215     169             14      147
148     0       250     307             232     134
351     1       191     343             189     9
31      0       32      327             8       201
33      0       503     484             85      166
973     0       651                     134     128
329     0       300                     186     195
271     1       543                             18
814     1       544                             123
274     1       349                             209
425     1                   

So, here is my code so far (with explanations)

import pandas as pd
import numpy as np
import statistics

df =pd.read_excel (r'D:\...data.xlsx')

## find the longest column and longest rows.........https://www.kite.com/python/answers/how-to-count-the-number-of-rows-in-a-pandas-dataframe-in-python
index = df.index
number_of_rows = len(index)

#####find number of columns ............https://www.w3resource.com/python-exercises/pandas/python-pandas-data-frame-exercise-57.php
number_of_columns = len(df.columns)

if number_of_rows>number_of_columns:
    row_based = input("it appears that this is a row based file, in other words, the data go from top to bottom, and the top row are the datas' title. If so, press enter, otherwise press any other keyboard and enter. " )
    if row_based == "":
    ######https://stackoverflow.com/questions/23979184/how-to-know-if-a-user-has-pressed-the-enter-key-using-python.................You should use \n as enter. It means the newline character.

Up until now, all I've done is determine which direction the data is in. Now I need to "clean" the data.

        ######this is the real work indent...##########################

        #quantify how many how many empty values are in each column..............https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe
        num_of_empty_cells_in_columns = df.isnull().sum(axis=0)
        # sort the coulmns based on how many empty values they  have............https://data-flair.training/blogs/sort-pandas-dataframes-series-array/
        columns_pd_sorted = num_of_empty_cells_in_columns.sort_values(ascending=True)

Now that I 'sorted' the columns (based on how many empty cells are in the entire column) I just have to pick the first column as the "lowest value". This means this is either the Y value (for the Multiple Linear Regression analysis later) or a data field that has just as much data as the y value.

        #find the lowest value (this is the value of the already sorted array)
        lowest_value=columns_pd_sorted[0]

I also wanted to take the average of the all the empty fields. This average (mean) will be used later (I think).

        mean_empty_cells=statistics.mean(columns_pd_sorted)

    #if the user says its horizontal data (instead of vertical data)
    else:
        print(" this code hasn't been built yet..")

#if the number of columns exceeds the number of rows (indicates horizontal data series)
else:
    print(" this code hasn't been built yet..")

My end goal is:

  1. ask the user to verify (or choose) the y variable.
  2. eliminate the columns with mostly empty fields.
  3. eliminate the rows that don't have have full data (all columns with data).

** !!!!!!!!!!!!!!!!!!!!!How I think I can solve it (THIS IS THE REAL QUESTION)!!!!!!!!!!!!!!!! **

I think I can solve all of this by doing the following, but have NO idea how to write the code. Ultimately my issue is I can't find a way to force pandas to give me the column name AND data.

  1. Make a list of columns that have equivalent empty cells as lowest_value. I assume some sort of iteration? This code below is literally a stab in the dark.
        y_variable_candidates = []
        for col in num_of_empty_cells_in_columns:
           if col=lowest_value:
              y_variable_candidates=y_variable_candidates + col

        y_variable = y_variable_candidates[1]

        y_variable_confirmation = input('currently your y variable is ' + str(y_variable) +' it appears that there are many y variable candidates, such as' + str(y_variable_candidates) + 'press enter if the current y variable is okay, otherwise press a number key to indicate which column should be the y variable')
        #... more code later on
  1. Again, make a list of all columns that have empty cells that exceed the mean (mean_empty_cells). Again I think I can do this via iteration
  2. set the y_variable as the first "lowest_value" but ask the user to confirm
  3. go back to df and drop all columns that match the
        mostly_empty_columns = []
        for col in num_of_empty_cells_in_columns:
           if col>mean_empty_cells:
              mostly_empty_columns=mostly_empty_columns + col

        #some code to get user to confirm to delete all the selected columns

desired end data:

y value data 1  data 2  data 3  data 5  data 6
282     1       215     169     14      147
148     0       250     307     232     134
351     1       191     343     189     9
31      0       32      327     8       201
33      0       503     484     85      166

I want to the run a Multiple Linear Regression Analysis on the above desired end data.

ANY HELP IS MUCH APPRECIATED THANK YOU!!!

seeker
  • 1
  • 1

1 Answers1

0

Here is an approach that uses the dropna() function. First, we have the initial data frame:

print(df)   # initial data frame
    y_value  data_1  data_2  data_3  data_4  data_5  data_6
0       282       1   215.0   169.0     NaN    14.0   147.0
1       148       0   250.0   307.0     NaN   232.0   134.0
2       351       1   191.0   343.0     NaN   189.0     9.0
3        31       0    32.0   327.0     NaN     8.0   201.0
4        33       0   503.0   484.0     NaN    85.0   166.0
5       973       0   651.0   134.0     NaN   128.0     NaN
6       329       0   300.0   186.0     NaN   195.0     NaN
7       271       1   543.0    18.0     NaN     NaN     NaN
8       814       1   544.0   123.0     NaN     NaN     NaN
9       274       1   349.0   209.0     NaN     NaN     NaN
10      425       1     NaN     NaN     NaN     NaN     NaN

Next, (a) we drop columns if every value is NaN and then (b) drop rows if any value is NaN:

# un-comment the next line to transpose the data frame (e.g., based on user input / user confirmation)
# df = df.transpose()

# delete columns with all NaN
df = df.dropna(axis=1, how='all')

# delete rows with 1 or more NaN
df = df.dropna(axis=0, how='any')

print(df)

   y_value  data_1  data_2  data_3  data_5  data_6
0      282       1   215.0   169.0    14.0   147.0
1      148       0   250.0   307.0   232.0   134.0
2      351       1   191.0   343.0   189.0     9.0
3       31       0    32.0   327.0     8.0   201.0
4       33       0   503.0   484.0    85.0   166.0

The dropna() docs are here

jsmart
  • 2,921
  • 1
  • 6
  • 13