I want to reshape multiple files in DASK with many different columns. ↓Code
def convert_1d_to_2d(l, cols):
return [l[i:i + cols] for i in range(0, len(l), cols)]
def read_excel(inputs, **kwargs):
return from_map(pd.read_excel, inputs, **kwargs)
files = glob.glob(r'/content/*.xlsx')
ddf = read_excel(files)
# Columns required for pivot_table: i:columns, product_name:index
ddf['i'] = str(1)
ddf['product_name'] = np.nan
ddf=ddf.categorize(columns=['product_name','i'])
product_columns=ddf.columns[:2]
date_list = ddf.columns[2:-2]
date_list = convert_1d_to_2d(date_list, 11)
# I want to pivot_table so I turn it by date
for i in range(len(date_list)):
columns_lis = []
columns_list.append(list(product_columns))
columns_list.append(list(date_list[i]))
columns_list = list(itertools.chain.from_iterable(columns_list))
# Combine columns because only one index column can be specified for dask.reshape.pivot_table
ddf['product_name'] = ddf['Unnamed: 0']+','+ddf['Unnamed: 1']+','+columns_list[2]
dff = dd.reshape.pivot_table(ddf, values=columns_list[2:], index='product_name', columns='i')
When I run the code, I get the following error in categorize().
ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra:['01/02/2022.part1', '01/02/2022.part2', '02/02/2022.part1', '02/02/2022.part2', ..., '31/02/2022.part1', '31/02/2022.part2']
Missing:['01/01/2022.part1', '01/01/2022.part2', '02/01/2022.part1', '02/01/2022.part2', ..., '31/01/2022.part1', '31/01/2022.part2']
I know the cause of this, because I compute() inside categorize().So to be precise, an error is occurring in the compute() part. (An error occurred when actually using compute().)
Below is the file data to read. (Actually, there are about 20 of them.)
file1(File January 2022)
Unnamed: 0 | Unnamed: 1 | 01/01/2022.part1 | 01/01/2022.part2 | 02/01/2022.part1 | 02/01/2022.part2 | ... | 31/01/2022.part1 | 31/01/2022.part2 |
---|---|---|---|---|---|---|---|---|
product_name | product_code | money | quantity | money | quantity | ... | money | quantity |
product_name | product_code | money | quantity | money | quantity | ... | money | quantity |
file2(File February 2022)
Unnamed: 0 | Unnamed: 1 | 01/02/2022.part1 | 01/02/2022.part2 | 02/02/2022.part1 | 02/02/2022.part2 | ... | 28/02/2022.part1 | 28/02/2022.part2 |
---|---|---|---|---|---|---|---|---|
product_name | product_code | money | quantity | money | quantity | ... | money | quantity |
product_name | product_code | money | quantity | money | quantity | ... | money | quantity |
Ultimately, the data I want to output is below.
expected final output:
product_name | money | quantity |
---|---|---|
ball,001,01/01/2022 | 30,000 | 10,000 |
ball,001,01/01/2022 | 15,000 | 10,000 |
I read the link below, but I figured it wouldn't work for data like this as it transforms different columns into the same column.