0

I have multiple .npz files in folder with same nature, I want to append all of my .npz files into a single .npz file present in a given folder

I have tried below code to achieve so, but it seems its not appending multiple .npz files to single npz file. Here is the code

import numpy as np
file_list = ['image-embeddings\img-emb-1.npz', 'image-embeddings\img-emb-2.npz']
data_all = [np.load(fname) for fname in file_list]
merged_data = {}
for data in data_all:
   [merged_data.update({k: v}) for k, v in data.items()]
np.savez('new_file.npz', **merged_data)

Where img-emb-1.npz has different value and img-emb-2.npz has different value

  • Does your operating system have an archive tool? – hpaulj Jan 30 '23 at 07:00
  • I am trying to merge image-embeddings\img-emb-1.npz & image-embeddings\img-emb-2.npz into single new_file.npz file, both npz files have same structure but with different data. – Sanidhya Chuahan Jan 30 '23 at 07:04
  • 1
    So `data.keys()` are the same for both? Then your code just writes the 2nd archive's values to the new archive? What is it supposed to with duplicate keys/names? I believe an archive tool would ask whether you want to overwrite, or change names in the case of duplicates. An npz contains npy files with names taken from the dict. – hpaulj Jan 30 '23 at 07:29
  • 1
    I think we need a small example - create 2 `npz` with actual data, and show the desired merged `npz`. – hpaulj Jan 30 '23 at 08:14
  • With the default windows10 archive, I can extract the `npy` files from a `file.zip` to a directory. If I try to extract duplicates into a directory it asks if I want to overwrite, or skip. I can't test my Linux tools at the moment. – hpaulj Jan 30 '23 at 08:17

1 Answers1

0

Maybe try the following to construct merged_data:

arrays_read = dict(
    chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)

Full example:

from itertools import chain
import numpy as np

file = lambda name: f"arrays/{name}.npz"

# Create data
arrays = {f"arr{i:02d}": np.random.randn(10, 20) for i in range(10)}

# Save data in separate files
for arr_name, arr in arrays.items():
    np.savez(file(arr_name), **{arr_name: arr})

# Read all files into a dict
arrays_read = dict(
    chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)

# Save into a single file
np.savez(file("arrays"), **arrays_read)

# Load to compare
arrays_read_single = dict(np.load(file("arrays")).items())

assert arrays_read.keys() == arrays_read_single.keys()
for k in arrays_read.keys():
    assert np.array_equal(arrays_read[k], arrays_read_single[k])
paime
  • 2,901
  • 1
  • 6
  • 17
  • You are putting each array into a separate `npz` file, one array per file. You could just as well have used `np.save` and written separate `npy` files. I think the OP has several arrays with duplicate array names. – hpaulj Jan 30 '23 at 08:12
  • Yes maybe, I made a guess, without further indication from OP for reproducibility ... – paime Jan 30 '23 at 08:45