0

I have an huge csv and i have to create a numpy array for the same argument in a certain column (the type ara about 10) but i have a problem with my list: it's too big and python goes down:

 def mem():
    file = pd.read_csv(file_csv)
    x = []
    y = []
    path_prec = 0
    for index, row in file.iterrows():
        if path_prec == 0:
            path_prec = row[0]
        if path_prec!= row[0]:
            X = np.stack(x, axis=0)  
            Y = np.stack(y, axis=0)
            #save X and Y
            x = []
            y = []
            path_prec = row[0]
        #do some stuff and create a list
        top = int(row[2]) 
        bottom = int(row[3])
        left = int(row[4])
        right = int(row[5])

        patch = image[top:bottom, left:right]

        patch_gt = gt[top:bottom, left:right]
        x.append(patch)
        y.append(patch_gt)

haw can i manage so huge data? with generator? how?

edit: this huge csv contains information to find data in fyle system

leoScomme
  • 109
  • 1
  • 1
  • 11
  • You need to give some information about variables, and may be even a small example case. How big is `file` (I assume that load works). `image` and `gt`? It looks like you collect slices of these arrays in the `x` and `y` lists, and then periodically turn them into arrays (`stack`) and `save` those. What do you mean by `save`? How big are `X` and `Y`? – hpaulj Oct 13 '17 at 16:31
  • file is a csv with about 56k row and 6 cols,but i have about 10-20 different path and for each row with same path i take a 224x224 matrix and i append this to x (and y)so x and y are very big list: approximately the list have 2500 array of 224x224. the csv is ordered by the path and when i found a new path i save(in my file system) the numpy arrays x and y and i restart with the next path – leoScomme Oct 13 '17 at 16:44
  • can you predict the final size (shape) of the NumPy array you would need? – norok2 Oct 13 '17 at 18:33
  • No, i can't.... – leoScomme Oct 13 '17 at 18:58
  • If you cannot predict at least an upper bound either, then even `memmap` can be difficult to use. Your memory issues arise with the `list` and not with `numpy`. Possibly, `pytables` can be useful here. Anyway, in general, you cannot expect the same speed when working on disk. In your code, it looks like you are letting `x`/`y` grow just to be lost at the end of the function. Also `image` is undefined. I would reconsider the approach (e.g. what are you going to do with `x` or `y`). If you want to avoid all this, you can also just wait longer and see if you really get a memory error or not. – norok2 Oct 13 '17 at 20:35
  • Does it fail while growing `x`, or when creating `X`? It looks like `x` is a list of views, so shouldn't that big of a memory footprint. But `X` would copy all those views into one large array. – hpaulj Oct 14 '17 at 12:55
  • it fail while growing x... – leoScomme Oct 14 '17 at 16:20

2 Answers2

1

You could create a NumPy's memmap object.

According to its documentation, this will:

Create a memory-map to an array stored in a binary file on disk.

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.

Probably, you want to manually parse the CSV to fill the memmap iteratively, for example using the chuncksize option. See for some code on how to use chunksize for a similar purpose: loading csv column into numpy memmap (fast)

norok2
  • 25,683
  • 4
  • 73
  • 99
  • sorry i don't understand how memmap can help with my problem...i have a csv with data in my row i have to use this data and save the result in a numpy array.the problem is (seeing my code) x and y that are too big. you say that x should be a memmap? – leoScomme Oct 13 '17 at 16:14
  • Read the docs a bit. The combination of chunked-reading and memmap is a viable approach. There is only one warning needed: access-performance is heavily dependend on access-statistics. Grabing sequential subblocks is fast; getting randomly-permutated single indices is slow as hell. – sascha Oct 13 '17 at 16:16
  • My impression is that the problem arises while collecting a bunch of `image` slices in the list `x` (and similarly for `y`). I assume he's reading the `csv` ok with the `pandas` reader. `file` is, presumably, a valid dataframe. I don't see how `memap` applies. – hpaulj Oct 13 '17 at 16:28
  • yes the problem is that x and y grows too much. how can i resolve? – leoScomme Oct 13 '17 at 17:33
  • @hpaulj if you replace `x` and `y`'s `list` with `memmap`, there you go, but of course with unknown size (not even an upper bound) this can be problematic https://github.com/numpy/numpy/issues/4198 – norok2 Oct 14 '17 at 10:45
0

quick naive solution: more than one numpy array for each path (for what i have to do is not important so the solution is the simplest

leoScomme
  • 109
  • 1
  • 1
  • 11
  • Although that may solve your specific problem, that does not seems to be the solution to the original problem (i.e. problem: my data is getting to big; solution: do not use that data). You should probably consider editing the question in a way to reflect your situation more closely or consider removing the question altogether. Right now it is very misleading. – norok2 Oct 16 '17 at 15:12
  • sorry my bad: i delete the question :) – leoScomme Oct 22 '17 at 16:16