0

I've got a line of code that is producing a pretty huge memory drain.

Here is the line:

X_train = (np.array(a).reshape(1000,100) for a in X_train)

Simply put, I'm reshaping each row of my dataset. The problem is, this creates a memory error that crashes my kernel, both locally and on AWS.

How can I rewrite that line using the xrange function, to reduce memory usage?

(Or any other way that would reduce memory usage)

Thanks!!!

Ashley O
  • 1,130
  • 3
  • 21
  • 34
  • 2
    How did you determine that this line causes the crash? It creates a generator. – cs95 Jul 06 '17 at 21:51
  • Well when I run it with X_train[0:25000], it says Memory Error. When I run it with X_train[0:20000] or less, it works. I know for sure this causes a memory error, I'm just looking for a less memory dependent method – Ashley O Jul 06 '17 at 21:54
  • That line by itself just defines a generator expression - the code inside the parentheses won't actually get executed until you iterate over the generator, which must be happening later on in your code. – ali_m Jul 06 '17 at 21:54
  • What was the original value of `X_train` (before the line shown above)? Was it a numpy array? If so, what were its dimensions and dtype? What are you using the new value of `X_train` for (the generator expression)? – ali_m Jul 06 '17 at 21:57
  • X_train was a column of a dataframe. X_train contains numpy arrays of type float, overall type 'O'. Is that helpful? I'm afraid I'm out of my depth here, I'm just reading what generator expressions are right now. EDIT: The shape is 1000 arrays of 100 numbers each. This is all for a neural network input – Ashley O Jul 06 '17 at 21:59
  • So if I understand correctly, each element in the dataframe column is a numpy array containing 1000*100 float values. How many rows are there in the dataframe? And more importantly, what are you using the "new" value of `X_train` (the generator expression) for? – ali_m Jul 06 '17 at 22:06

1 Answers1

1

As mentioned by @ali_m, that line by itself is a generator expression. This means that none of the elements of X_train are evaluated until the elements of that generator expression are evaluated. You must be evaluating all of the elements of X_train and storing them in memory somewhere later in your code, possibly by doing list(X_train), appending every element of X_train to a list, or something similar. This will create a list which is equal in length to your original X_train before the generator expression, hence causing a memory error if it is too big.

The original X_traincannot be garbage collected while the generator expression is still being evaluated, so by creating a list of the new X_train, you are creating two huge lists, which is probably why it runs out of memory.

In this case, you can't use xrange to make your code more efficient, because it is already a generator expression. The best thing to do would be to look at how X_train is used later in your code and try to iterate over it (for _ in X_train) as opposed to making it into a list (list(X_train)) wherever possible.

Tuomas Laakkonen
  • 1,280
  • 10
  • 15