Cleaning Excel File Using Pandas

Question

I have an excel file which I've read using Pandas the output is as follows:

+------------+-----+-----+-----+
| Type       | 1   | 2   | 3   |
| Category   | A   | A   | C   |
| Dates      | NaN | NaN | NaN |
| 01/01/2021 | 12  | 12  | 9   |
| 02/01/2021 | 10  | 10  | 2   |
| 03/01/2021 | 30  | 16  | NaN |
| 04/01/2021 | 15  | 23  | 4   |
| 05/01/2021 | 14  | 20  | 5   |
+------------+-----+-----+-----+

The first two rows gives information for each time series column wise. So for column 1 the Type is 1 and Category is A. I want to melt the time series and not quite sure how to approach the problem given the structure of the sheet.

Expected Output:

+------------+-------+----------+------+
|   Dates    | Price | Category | Type |
+------------+-------+----------+------+
| 01/01/2021 |    12 | A        |    1 |
| 02/01/2021 |    10 | A        |    1 |
| 03/01/2021 |    30 | A        |    1 |
| 04/01/2021 |    15 | A        |    1 |
| 05/01/2021 |    14 | A        |    1 |
| 01/01/2021 |    12 | B        |    2 |
| 02/01/2021 |    10 | B        |    2 |
| 03/01/2021 |    16 | B        |    2 |
| 04/01/2021 |    23 | B        |    2 |
| 05/01/2021 |    20 | B        |    2 |
| 01/01/2021 |     9 | C        |    3 |
| 02/01/2021 |     2 | C        |    3 |
| 04/01/2021 |     4 | C        |    3 |
| 05/01/2021 |     5 | C        |    3 |
+------------+-------+----------+------+

In case of Type 3 and Category C since value is NaN we drop that date. How can achieve the expected output?

score 2 · Accepted Answer · answered Sep 13 '21 at 17:32

Assuming the following input dataframe:

         col0 col1 col2 col3
0        Type    1    2    3
1    Category    A    A    C
2       Dates  NaN  NaN  NaN
3  01/01/2021   12   12    9
4  02/01/2021   10   10    2
5  03/01/2021   30   16  NaN
6  04/01/2021   15   23    4
7  05/01/2021   14   20    5

Here is a working pipeline:

(df.iloc[3:]
   .set_index('col0').rename_axis('Date') # set first column aside
   # next 3 lines to rename columns index
   .T
   .set_index(pd.MultiIndex.from_arrays(df.iloc[:2, 1:].values, names=df.iloc[:2, 0]))
   .T
   .stack(level=[0,1]) # columns to rows
   .rename('Price')    # rename last unnamed column
   .reset_index()      # all indexes back to columns
)

output:

          Date Type Category Price
0   01/01/2021    1        A    12
1   01/01/2021    2        A    12
2   01/01/2021    3        C     9
3   02/01/2021    1        A    10
4   02/01/2021    2        A    10
5   02/01/2021    3        C     2
6   03/01/2021    1        A    30
7   03/01/2021    2        A    16
8   04/01/2021    1        A    15
9   04/01/2021    2        A    23
10  04/01/2021    3        C     4
11  05/01/2021    1        A    14
12  05/01/2021    2        A    20
13  05/01/2021    3        C     5

score 1 · Answer 2 · answered Sep 13 '21 at 17:39

Another solution:

df = df.T
df.columns = df.loc["col0"]
df = df.iloc[1:]

print(
    df.melt(["Type", "Category", "Dates"])
    .drop(columns="Dates")
    .rename(columns={"col0": "Dates", "value": "Price"})
    .sort_values(by=["Type", "Category", "Dates"])
    .dropna()
    .reset_index(drop=True)
)

Prints:

   Type Category       Dates Price
0     1        A  01/01/2021    12
1     1        A  02/01/2021    10
2     1        A  03/01/2021    30
3     1        A  04/01/2021    15
4     1        A  05/01/2021    14
5     2        B  01/01/2021    12
6     2        B  02/01/2021    10
7     2        B  03/01/2021    16
8     2        B  04/01/2021    23
9     2        B  05/01/2021    20
10    3        C  01/01/2021     9
11    3        C  02/01/2021     2
12    3        C  04/01/2021     4
13    3        C  05/01/2021     5

df used:

         col0 col1 col2 col3
0        Type    1    2    3
1    Category    A    B    C
2       Dates  NaN  NaN  NaN
3  01/01/2021   12   12    9
4  02/01/2021   10   10    2
5  03/01/2021   30   16  NaN
6  04/01/2021   15   23    4
7  05/01/2021   14   20    5

Cleaning Excel File Using Pandas

2 Answers2