0

I want to build a for loop to only select row 5, row 10 and row 14 in pandas.

enter image description here

The actual file include thousands of rows in similar format. Please teach me a function that can go over the entire file.

Many Thanks !!!

Attached is my current progress:

df = pd.read_csv('C:/Users/ymx19/Desktop/EHS/Location/results/Batch3_enterprise_with_missing_level/HOU.csv',header = 0)
df = df.dropna(axis ='columns',how ='all')
headers_list = [x for x in df.columns]
count = len(headers_list)
k = headers_list[-1]
maxlevel = df[df[k].notna()].drop_duplicates(subset= headers_list, keep="last")
while count > 3:
    k = headers_list[-1]
    headers_list.pop()
    z = headers_list[-1]
    lower_level = df.drop_duplicates(subset=headers_list, keep="last")
    lower_level = lower_level[lower_level[z].notna() & lower_level[k].isna()]
    maxlevel.append(lower_level)
    count -= 1 
maxlevel.to_csv('C:\\Users\\ymx19\\Desktop/EHS\\Location\\results\\test\\HOU.csv', index = False)

Question: The the final maxlevel.csv didn't include any appended values from the for loop

Yumeng Xu
  • 179
  • 1
  • 2
  • 11

1 Answers1

1

In general, you can use slicing for this with df.iloc[start_row:end_row, start_column:end_column] or you can select specific rows with df.iloc[[4,9,13]].

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

If you want to remove duplicates you can use:

df.drop_duplicates(subset=["Customer", "Level1", "Level2"], keep="last).

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Arne Decker
  • 808
  • 1
  • 3
  • 9
  • Hi Arne, thank you for your suggestions. The challenge is all rows are unique, for example row 2-4 are duplicate with row 5 in meaning but not in actual format. The drop_duplicates wont delete any rows. – Yumeng Xu Aug 31 '21 at 17:47
  • I'm not quite sure wether I understand you, but the check for duplicates can be controlled by the `substitue` parameter. You pass all columns that are considered for the check. If you pass `substitute=["Customer", "Level1", "Level2"]`, then the first four rows are detected as duplicate, the next five rows, too and so on. – Arne Decker Aug 31 '21 at 17:51
  • Hi Arne, Your suggestion works, I really appreciate for your help! I combined your suggestion with dropna() and isna() function to solve the problem. Many thanks ! One more thing, the parameter should be 'subset'. – Yumeng Xu Aug 31 '21 at 18:33