-1

I'm struggling with a H5 file to extract and save data as a multi column csv. as shown in the picture the structure of h5 file consisted of main groups (Genotypes, Positions, and taxa). The main group, Genotypes contains more than 1500 subgroups (genotype partial names) and each subgroup contains sub-sun groups (complete name of genotypes).There are about 1 million data sets (named calls) -each one is laid in one sub-sub group - which i need them to be written - each one - in a separate column. The problem is that when i use h5py (group.get function) i have to use the path of any calls. I extracted the all paths containing "calls" at the end of path but I cant reach all 1 million calls to get them into a csv file. could anybody help me to extracts "calls" which are 8bit integer i\as a separate columns in a csv file. By running the code in first answer I get this error:

  1. Traceback (most recent call last): File "path/file.py", line 32, in h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string! File "path/file.py", line 565, in visititems return h5o.visit(self.id, proxy) File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
    File "h5py\h5o.pyx", line 355, in h5py.h5o.visit File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File "path/file.py", line 564, in proxy return func(name, self[name]) File "path/file.py", line 10, in dump_calls2csv np.savetxt(csvfname, arr, fmt='%5d', delimiter=',') File "<array_function internals>", line 6, in savetxt File "path/file.py", line 1377, in savetxt open(fname, 'wt').close() OSError: [Errno 22] Invalid argument: 'Genotypes_ArgentineFlintyComposite-C(1)-37-B-B-B2-1-B25-B2-B?-1-B:100000977_calls.csv
  • What is a `sub-sun group`? `There are [many] data sets [i need written] in a separate column` do you intend *row*? *line*? – greybeard May 11 '20 at 11:41
  • I looked at your schema. I assume you are writing each dataset in `/Genotypes/GroupN/SubGroupN/calls` to individual CSV files (where N identifies the Groups and Subgroups). This will create GroupN x GroupSubgroupN files. You can do this with `.visititems()` to recursively loop on the Genotype groups. Check the object type (group or dataset) in the `.visititems()` callable function. When you find a `calls` dataset: a) read the data into a Numpy array, b) create a unique file name (based on GroupN_GroupSubgroupN), c) write the data to the file with `numpy.savetxt()`. – kcw78 May 11 '20 at 12:54
  • BTW, why do you want to export HDF5 data to CSV? – kcw78 May 11 '20 at 13:17

1 Answers1

0

16-May-2020 Update:

  • Added a second example that reads and exports using Pytables (aka tables) using .walk_nodes(). I prefer this method over h5py .visititems()
  • For clarity, I separated the code that creates the example file from the 2 examples that read and export the CSV data.

Enclosed below are 2 simple examples that show how to recursively loop on all top level objects. For completeness, the code to create the test file is at the end of this post.

Example 1: with h5py
This example uses the .visititems() method with a callable function (dump_calls2csv).
Summary of this procedure:
1) Checks for dataset objects with calls in the name.
2) When it finds a matching object it does the following:
a) reads the data into a Numpy array,
b) creates a unique file name (using string substitution on the H5 group/dataset path name to insure uniqueness),
c) writes the data to the file with numpy.savetxt().

import h5py
import numpy as np

def dump_calls2csv(name, node):    

    if isinstance(node, h5py.Dataset) and 'calls' in node.name :
       print ('visiting object:', node.name, ', exporting data to CSV')
       csvfname = node.name[1:].replace('/','_') +'.csv'
       arr = node[:]
       np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')

##########################    

with h5py.File('SO_61725716.h5', 'r') as h5r :        
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!

If you want to get fancy, you can replace arr in np.savetxt() with node[:].
Also, you you want headers in your CSV, extract and reference the dtype field names from the dataset (I did not create any in this example).

Example 2: with PyTables (tables)
This example uses the .walk_nodes() method with a filter: classname='Leaf'. In PyTables, a leaf can be any of the storage classes (Arrays and Table).
The procedure is similar to the method above. walk_nodes() simplifies the process to find datasets and does NOT require a call to a separate function.

import tables as tb
import numpy as np

with tb.File('SO_61725716.h5', 'r') as h5r :     
    for node in h5r.walk_nodes('/',classname='Leaf') :         
       print ('visiting object:', node._v_pathname, 'export data to CSV')
       csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
       np.savetxt(csvfname, node.read(), fmt='%d', delimiter=',')

For completeness, use the code below to create the test file used in the examples.

import h5py
import numpy as np

ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2

with h5py.File('SO_61725716.h5', 'w') as h5w :    
    for gcnt in range(ngrps):
        grp1 = h5w.create_group('Group_'+str(gcnt))
        for scnt in range(nsgrps):
            grp2 = grp1.create_group('SubGroup_'+str(scnt))
            for dcnt in range(nds):
                i_arr = np.random.randint(1,100, (nrows,ncols) )
                ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • That is fantastic, Could you please explain what is "name" and "node" in "def dump_calls2csv(name, node)". – Pouya M. Noparvar May 16 '20 at 10:59
  • `name` is the name of the object and `node` is a Group or Dataset instance (or object). (Node is HDF5 terminology.). Name is the relative pathname from the reference group ('/' in this case). If you add `print(name)` to `dump_calls2csv` above, you will get: `Group_0/Group_0/calls_0` (and so on) for each dataset. This is similar to `node.name` output (without leading '/'). I use `object` in the type test. If you only need the object name, you can use `group.visit(callable)`. – kcw78 May 16 '20 at 15:05
  • Thanks a lot, This was enormously useful but I get error among the running code. – Pouya M. Noparvar May 17 '20 at 07:24
  • Is your error in your code? If so, what is the error? To diagnose, start with a very simple callable function. Maybe print name and object.name. Once that works, test the object as an instance of h5py.Dataset or h5py.Group. Also, use a simple main function to open the HDF5 file then `.visititems()`. I prefer to develop, test, and debug in an interactive console. – kcw78 May 17 '20 at 18:29
  • I have tested the code on sample h5 file and it worked correctly but on my main h5 file it gives error which I added in the question above. – Pouya M. Noparvar May 18 '20 at 03:36
  • And for the second code I get TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d') – Pouya M. Noparvar May 18 '20 at 03:45
  • value = self._g_getattr(self._v_node, name) value = self._g_getattr(self._v_node, name) v = format % tuple(row) + newline TypeError: %d format: a number is required, not numpy.bytes_ During handling of the above exception, another exception occurred: – Pouya M. Noparvar May 18 '20 at 03:49
  • np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',') File "<__array_function__ internals>", line 6, in savetxt % (str(X.dtype), format)) TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d') – Pouya M. Noparvar May 18 '20 at 03:50
  • It seems that the numpy arrays are not of equal dimension and error occurs. what can I do for this? – Pouya M. Noparvar May 18 '20 at 03:53
  • When you export the data, datatypes in `np.savetxt( fmt= )` have to match the HDF5 datatypes you are writing. You get an error because you have an array of strings (`|S1`), and you used my integer format : `fmt='%5d`. `fmt=` is optional. Remove and see if it works. Or, try `fmt='%s` If you need formatted output, you will have to create a matching format based on the dtype of the dataset/array. Additional code will be required if you want to join the array of S1 strings and print as a single string. – kcw78 May 18 '20 at 13:02
  • I tried with/without fmt but still error apearse. value = self._g_getattr(self._v_node, name) value = self._g_getattr(self._v_node, name) variable length strings are not supported yet rest of the errors are added to next comment. – Pouya M. Noparvar May 19 '20 at 04:03
  • The leaf will become an ``UnImplemented`` node. % (self._g_join(childname), exc)) np.savetxt(csvfname, node.read(), fmt= '%s', delimiter=',') AttributeError: 'UnImplemented' object has no attribute 'read' visiting object: /Positions/Chromosomes export data to CSV – Pouya M. Noparvar May 19 '20 at 04:04
  • This has been turned to a real challenge for me. – Pouya M. Noparvar May 19 '20 at 04:05
  • At this point, your question has evolved from the original. I suggest a new question that shows your current code and the errors you are getting. Ideally you also provide an example h5 file. It looks like you are now using PyTables, so need to use that tag in the new question. – kcw78 May 19 '20 at 20:33