1

I am working with a data with thousands of rows but I have uneven columns as shown below:

AB  12   43   54

DM  33   41   45   56   33   77  88

MO  88   55   66   32   34 

KL  10   90   87   47   23  48  56  12

First, I want to read the data in list or array and then find out the length of longest row.
Then, I will add zeros to the short rows to equal them to the longest one, so that I can iterate them as a 2D array.

I have tried a couple of other similar questions, but could not work out the problem.

I believe there is a way in Python to do this. Could anyone please help me out?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
PyLabour
  • 245
  • 3
  • 5
  • 15
  • How would you like to actually use the data (the "2D array")? I suspect a heterogeneous 2D array is not the right data structure for this problem. If all you need is the longest row then you don't need numpy at all. – Ali Nov 26 '15 at 11:51
  • I am thinking to add zeros at end of the short rows to equal them to the longest one. Then, I can index the whole data easily. – PyLabour Nov 26 '15 at 11:56
  • is you data list of list? or is it external file? If you need the longest from list of lists you could use list comperehension and find it easily. – Anton Protopopov Nov 26 '15 at 12:02
  • It is in a text file. I need to read the file first and then to find out the longest row. I have edited my question above. Please, have a look. – PyLabour Nov 26 '15 at 12:07

2 Answers2

2

I don't see any easier way to figure out the maximum row length but to do one pass and to find it. Then, we build the 2D array in a second pass. Something like:

from __future__ import print_function
import numpy as np
from itertools import chain

data = '''AB 12 43 54
DM 33 41 45 56 33 77 88
MO 88 55 66 32 34
KL 10 90 87 47 23 48 56 12'''

max_row_len = max(len(line.split()) for line in data.splitlines())

def padded_lines():
    for uneven_line in data.splitlines():
        line = uneven_line.split()
        line += ['0']*(max_row_len - len(line))
        yield line

# I will get back to the line below shortly, it unnecessarily creates the array
# twice in memory:
array = np.array(list(chain.from_iterable(padded_lines())), np.dtype(object))

array.shape = (-1, max_row_len)

print(array)

This prints:

[['AB' '12' '43' '54' '0' '0' '0' '0' '0']
 ['DM' '33' '41' '45' '56' '33' '77' '88' '0']
 ['MO' '88' '55' '66' '32' '34' '0' '0' '0']
 ['KL' '10' '90' '87' '47' '23' '48' '56' '12']]

The above code is inefficient in the sense that it creates the array twice in memory. I will get back to it; I think I can fix that.

However, numpy arrays are supposed to be homogeneous. You want to put strings (the first column) and integers (all the other columns) in the same 2D array. I still think you are on the wrong track here and should rethink the problem and pick another data structure or organize your data differently. I cannot help you with that since I don't know how you want to use the data.

(I will get back to the array created twice issue shortly.)


As promised, here is the solution to the efficiency issues. Note that my concerns were about memory consumption.

    def main():

        with open('/tmp/input.txt') as f:
            max_row_len = max(len(line.split()) for line in f)

        with open('/tmp/input.txt') as f:
            str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))

        def padded_lines():
            with open('/tmp/input.txt') as f:
                for uneven_line in f:
                    line = uneven_line.split()
                    line += ['0']*(max_row_len - len(line))
                    yield line

        fmt = '|S%d' % str_len_max
        array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))

This code could be made nicer but I will leave that up to you.

The memory consumption, measured with memory_profiler on a randomly generated input file with 1000000 lines and uniformly distributed row lengths between 1 and 20:

Line #    Mem usage    Increment   Line Contents
================================================
     5   23.727 MiB    0.000 MiB   @profile
     6                             def main():
     7                                 
     8   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     9   23.727 MiB    0.000 MiB           max_row_len = max(len(line.split()) for line in f)
    10                                     
    11   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
    12   23.727 MiB    0.000 MiB           str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
    13                                 
    14   23.727 MiB    0.000 MiB       def padded_lines():
    15                                     with open('/tmp/input.txt') as f:
    16   62.000 MiB   38.273 MiB               for uneven_line in f:
    17                                             line = uneven_line.split()
    18                                             line += ['0']*(max_row_len - len(line))
    19                                             yield line
    20                                 
    21   23.727 MiB  -38.273 MiB       fmt = '|S%d' % str_len_max
    22                                 array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
    23   62.004 MiB   38.277 MiB       array.shape = (-1, max_row_len)

With the code eumiro's answer, and with the same input file:

Line #    Mem usage    Increment   Line Contents
================================================
     5   23.719 MiB    0.000 MiB   @profile
     6                             def main():
     7   23.719 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     8  638.207 MiB  614.488 MiB           arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

Comparing the memory consumption increments: My updated code consumes 16 times less memory than eumiro's (614.488/38.273 is approx. 16).

As for speed: My updated code runs for this input for 3.321s, eumiro's code runs for 5.687s, that is, mine is 1.7x faster on my machine. (Your mileage may vary.)

If efficiency is your primary concern (as suggested by your comment "Hi eumiro, I suppose this is more efficient." and then changing the accepted answer), then I am afraid you accepted the less efficient solution.

Don't get my wrong, eumiro's code is really concise, and I certainly learned a lot from it. If efficiency is not my primary concern, I would go with eumiro's solution too.

Ali
  • 56,466
  • 29
  • 168
  • 265
  • Hi Ali, your code has worked excellent. This is all what I wanted. Thank you very much. Much appreciated!!! – PyLabour Nov 26 '15 at 12:55
  • @user30337 Please check my updated answer. My solution consumes 16x less memory and is 1.7x faster than the currently accepted answer. – Ali Nov 26 '15 at 14:59
  • thank you once again. I have little understanding about the speed. However, I am grateful to both of you for your kind helps. – PyLabour Nov 27 '15 at 22:33
1

You can use itertools.izip_longest which does the finding for the longest line for you:

import itertools as it
import numpy as np

with open('filename.txt') as f:
    arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

arr is now:

array([['a', '1', '2', '0'],
       ['b', '3', '4', '5'],
       ['c', '6', '0', '0']], 
      dtype='|S1')
eumiro
  • 207,213
  • 34
  • 299
  • 261