Reshape multiple files in DASK with many different columns

Question

I want to reshape multiple files in DASK with many different columns. ↓Code

def convert_1d_to_2d(l, cols):
        return [l[i:i + cols] for i in range(0, len(l), cols)]
def read_excel(inputs, **kwargs):
    return from_map(pd.read_excel, inputs, **kwargs)

files = glob.glob(r'/content/*.xlsx')
ddf = read_excel(files)

# Columns required for pivot_table: i:columns, product_name:index
ddf['i'] = str(1)
ddf['product_name'] = np.nan
ddf=ddf.categorize(columns=['product_name','i'])

product_columns=ddf.columns[:2]
date_list = ddf.columns[2:-2]
date_list = convert_1d_to_2d(date_list, 11)

# I want to pivot_table so I turn it by date
for i in range(len(date_list)):
    columns_lis = []
    columns_list.append(list(product_columns))
    columns_list.append(list(date_list[i]))
    columns_list = list(itertools.chain.from_iterable(columns_list))
    # Combine columns because only one index column can be specified for dask.reshape.pivot_table
    ddf['product_name'] = ddf['Unnamed: 0']+','+ddf['Unnamed: 1']+','+columns_list[2]
    dff = dd.reshape.pivot_table(ddf, values=columns_list[2:], index='product_name', columns='i')

When I run the code, I get the following error in categorize().

ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra:['01/02/2022.part1', '01/02/2022.part2', '02/02/2022.part1', '02/02/2022.part2', ..., '31/02/2022.part1', '31/02/2022.part2']
Missing:['01/01/2022.part1', '01/01/2022.part2', '02/01/2022.part1', '02/01/2022.part2', ..., '31/01/2022.part1', '31/01/2022.part2']

I know the cause of this, because I compute() inside categorize().So to be precise, an error is occurring in the compute() part. (An error occurred when actually using compute().)

Below is the file data to read. (Actually, there are about 20 of them.)

file1(File January 2022)

Unnamed: 0	Unnamed: 1	01/01/2022.part1	01/01/2022.part2	02/01/2022.part1	02/01/2022.part2	...	31/01/2022.part1	31/01/2022.part2
product_name	product_code	money	quantity	money	quantity	...	money	quantity
product_name	product_code	money	quantity	money	quantity	...	money	quantity

file2(File February 2022)

Unnamed: 0	Unnamed: 1	01/02/2022.part1	01/02/2022.part2	02/02/2022.part1	02/02/2022.part2	...	28/02/2022.part1	28/02/2022.part2
product_name	product_code	money	quantity	money	quantity	...	money	quantity
product_name	product_code	money	quantity	money	quantity	...	money	quantity

Ultimately, the data I want to output is below.

expected final output:

product_name	money	quantity
ball,001,01/01/2022	30,000	10,000
ball,001,01/01/2022	15,000	10,000

I read the link below, but I figured it wouldn't work for data like this as it transforms different columns into the same column.

All the columns need to have identical columns and data types before you concatenate them. So do some preprocessing to align the columns along with `read_excel` in function that you call with `from_map` — Michael Delgado, Nov 27 '22 at 12:07
Oh - you’re trying to merge the dataframes horizontally not concat them? Partitions in dask only work row-wise, not column-wise. So if the data is small enough to fit into memory just use pandas with pd.merge. Otherwise, you’ll need to partition each excel file or transpose them so the dates are in the rows. — Michael Delgado, Nov 27 '22 at 14:20
@MichaelDelgado Thanks for your quick response.Also, thank you for correcting the table.Is it not a very recommended method to operate by specifying the argument enforce_medata=False? The date in the column is also important, so I don't want to unify the columns in multiple files without the date, which is an annoyance.(Eventually I need the date because I want to reshape the data using dask.reshape.pivot_table()) It also vertically concatenates the dataframes.　　final output table　　 | money | quantity | date|　　 |-------|----------|-----:|　　 |1600|1.0|01/01/2022|　　 |1000|2.0|01/02/2022| — kaai, Nov 27 '22 at 18:34
Yeah you can’t put markdown in comments. If your question could use additional information or clarification, just [edit] it directly rather than providing more info in comments. Thanks! — Michael Delgado, Nov 27 '22 at 19:16
If you’re eventually planning to reshape the data, best to state that up front. Reshape operations are really challenging and memory intensive for larger-than-memory datasets. Can you clarify exactly what your goal is? — Michael Delgado, Nov 27 '22 at 19:19
@MichaelDelgado Thank you for teaching me various things.I edited it directly to be more specific about what I want to do.I was wondering if you could read it. — kaai, Nov 28 '22 at 07:29
Try reshaping and cleaning each excel file on read in a custom function, then mapping that, rather than just mapping read_excel and then reshaping — Michael Delgado, Nov 28 '22 at 08:05
@MichaelDelgado As you said, I used a custom function to reshape and clean each read excel file and it ran without errors.Thank you so much for your help! — kaai, Nov 28 '22 at 12:18

Reshape multiple files in DASK with many different columns

0 Answers0