pandas DataFrame select specific data

Question

I want to build a for loop to only select row 5, row 10 and row 14 in pandas.

The actual file include thousands of rows in similar format. Please teach me a function that can go over the entire file.

Many Thanks !!!

Attached is my current progress:

df = pd.read_csv('C:/Users/ymx19/Desktop/EHS/Location/results/Batch3_enterprise_with_missing_level/HOU.csv',header = 0)
df = df.dropna(axis ='columns',how ='all')
headers_list = [x for x in df.columns]
count = len(headers_list)
k = headers_list[-1]
maxlevel = df[df[k].notna()].drop_duplicates(subset= headers_list, keep="last")
while count > 3:
    k = headers_list[-1]
    headers_list.pop()
    z = headers_list[-1]
    lower_level = df.drop_duplicates(subset=headers_list, keep="last")
    lower_level = lower_level[lower_level[z].notna() & lower_level[k].isna()]
    maxlevel.append(lower_level)
    count -= 1 
maxlevel.to_csv('C:\\Users\\ymx19\\Desktop/EHS\\Location\\results\\test\\HOU.csv', index = False)

Question: The the final maxlevel.csv didn't include any appended values from the for loop

as far as i understand you could use .loc and .iloc. but for better answer please share your piece of dataframe. — kağan hazal koçdemir, Aug 31 '21 at 16:48
for example, row 2-5 describe the same topic. row 5 include all information of that topic. I only want to keep row 5 and deleted all other duplicate rows (2-4) — Yumeng Xu, Aug 31 '21 at 17:26
Row 10 is the longest row, so it is relative easy to select, but the challenge I am facing now is how to avoid selecting row 9 together with row 5 and row 14 in my for loop. — Yumeng Xu, Aug 31 '21 at 17:35

Arne Decker · Accepted Answer · 2021-08-31T20:05:08.750

1

In general, you can use slicing for this with df.iloc[start_row:end_row, start_column:end_column] or you can select specific rows with df.iloc[[4,9,13]].

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

If you want to remove duplicates you can use:

df.drop_duplicates(subset=["Customer", "Level1", "Level2"], keep="last).

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

edited Aug 31 '21 at 20:05

answered Aug 31 '21 at 17:33

Arne Decker

808
1
3
9

Hi Arne, thank you for your suggestions. The challenge is all rows are unique, for example row 2-4 are duplicate with row 5 in meaning but not in actual format. The drop_duplicates wont delete any rows. – Yumeng Xu Aug 31 '21 at 17:47
I'm not quite sure wether I understand you, but the check for duplicates can be controlled by the `substitue` parameter. You pass all columns that are considered for the check. If you pass `substitute=["Customer", "Level1", "Level2"]`, then the first four rows are detected as duplicate, the next five rows, too and so on. – Arne Decker Aug 31 '21 at 17:51
Hi Arne, Your suggestion works, I really appreciate for your help! I combined your suggestion with dropna() and isna() function to solve the problem. Many thanks ! One more thing, the parameter should be 'subset'. – Yumeng Xu Aug 31 '21 at 18:33

pandas DataFrame select specific data

1 Answers1