0

I want to reshape multiple files in DASK with many different columns. ↓Code

def convert_1d_to_2d(l, cols):
        return [l[i:i + cols] for i in range(0, len(l), cols)]
def read_excel(inputs, **kwargs):
    return from_map(pd.read_excel, inputs, **kwargs)

files = glob.glob(r'/content/*.xlsx')
ddf = read_excel(files)

# Columns required for pivot_table: i:columns, product_name:index
ddf['i'] = str(1)
ddf['product_name'] = np.nan
ddf=ddf.categorize(columns=['product_name','i'])

product_columns=ddf.columns[:2]
date_list = ddf.columns[2:-2]
date_list = convert_1d_to_2d(date_list, 11)

# I want to pivot_table so I turn it by date
for i in range(len(date_list)):
    columns_lis = []
    columns_list.append(list(product_columns))
    columns_list.append(list(date_list[i]))
    columns_list = list(itertools.chain.from_iterable(columns_list))
    # Combine columns because only one index column can be specified for dask.reshape.pivot_table
    ddf['product_name'] = ddf['Unnamed: 0']+','+ddf['Unnamed: 1']+','+columns_list[2]
    dff = dd.reshape.pivot_table(ddf, values=columns_list[2:], index='product_name', columns='i')

When I run the code, I get the following error in categorize().

ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra:['01/02/2022.part1', '01/02/2022.part2', '02/02/2022.part1', '02/02/2022.part2', ..., '31/02/2022.part1', '31/02/2022.part2']
Missing:['01/01/2022.part1', '01/01/2022.part2', '02/01/2022.part1', '02/01/2022.part2', ..., '31/01/2022.part1', '31/01/2022.part2']

I know the cause of this, because I compute() inside categorize().So to be precise, an error is occurring in the compute() part. (An error occurred when actually using compute().)

Below is the file data to read. (Actually, there are about 20 of them.)

file1(File January 2022)

Unnamed: 0 Unnamed: 1 01/01/2022.part1 01/01/2022.part2 02/01/2022.part1 02/01/2022.part2 ... 31/01/2022.part1 31/01/2022.part2
product_name product_code money quantity money quantity ... money quantity
product_name product_code money quantity money quantity ... money quantity

file2(File February 2022)

Unnamed: 0 Unnamed: 1 01/02/2022.part1 01/02/2022.part2 02/02/2022.part1 02/02/2022.part2 ... 28/02/2022.part1 28/02/2022.part2
product_name product_code money quantity money quantity ... money quantity
product_name product_code money quantity money quantity ... money quantity

Ultimately, the data I want to output is below.

expected final output:

product_name money quantity
ball,001,01/01/2022 30,000 10,000
ball,001,01/01/2022 15,000 10,000

I read the link below, but I figured it wouldn't work for data like this as it transforms different columns into the same column.

kaai
  • 1
  • 1
  • All the columns need to have identical columns and data types before you concatenate them. So do some preprocessing to align the columns along with `read_excel` in function that you call with `from_map` – Michael Delgado Nov 27 '22 at 12:07
  • Oh - you’re trying to merge the dataframes horizontally not concat them? Partitions in dask only work row-wise, not column-wise. So if the data is small enough to fit into memory just use pandas with pd.merge. Otherwise, you’ll need to partition each excel file or transpose them so the dates are in the rows. – Michael Delgado Nov 27 '22 at 14:20
  • @MichaelDelgado Thanks for your quick response.Also, thank you for correcting the table.Is it not a very recommended method to operate by specifying the argument enforce_medata=False? The date in the column is also important, so I don't want to unify the columns in multiple files without the date, which is an annoyance.(Eventually I need the date because I want to reshape the data using dask.reshape.pivot_table()) It also vertically concatenates the dataframes.  final output table   | money | quantity | date|   |-------|----------|-----:|   |1600|1.0|01/01/2022|   |1000|2.0|01/02/2022| – kaai Nov 27 '22 at 18:34
  • Yeah you can’t put markdown in comments. If your question could use additional information or clarification, just [edit] it directly rather than providing more info in comments. Thanks! – Michael Delgado Nov 27 '22 at 19:16
  • If you’re eventually planning to reshape the data, best to state that up front. Reshape operations are really challenging and memory intensive for larger-than-memory datasets. Can you clarify exactly what your goal is? – Michael Delgado Nov 27 '22 at 19:19
  • @MichaelDelgado Thank you for teaching me various things.I edited it directly to be more specific about what I want to do.I was wondering if you could read it. – kaai Nov 28 '22 at 07:29
  • Try reshaping and cleaning each excel file on read in a custom function, then mapping that, rather than just mapping read_excel and then reshaping – Michael Delgado Nov 28 '22 at 08:05
  • 1
    @MichaelDelgado As you said, I used a custom function to reshape and clean each read excel file and it ran without errors.Thank you so much for your help! – kaai Nov 28 '22 at 12:18

0 Answers0