2

I have come across this problem several times. The problem is that I cannot understand how to iterate through a pandas series in a DataFrame to access individual values.

In this particular case I am trying to find the maximum value for each row in a specific column in a pandas DataFrame, some rows of which contain lists.

df is as such:

  Date            Number
0 2000-01-01        [1.0]
1 2000-01-02        [2.2, 5, 7.8]
2 2000-01-03        [8.2]
3 2000-01-04        [4, 11.78, 24.66]

The attempted code has been the following in relation to this question:

Find the max of two or more columns with pandas

However I am trying to replace the current column and for some reason it seems to provide my column with an empty list.

Desired output would be the following:

  Date            Number
0 2000-01-01        1.0
1 2000-01-02        7.8
2 2000-01-03        8.2
3 2000-01-04        24.66

Taking the max of the row and replacing the original. Any suggestions as to how to do this?

Thanks in advance.

geds133
  • 1,503
  • 5
  • 20
  • 52

2 Answers2

1

Use list comprehension with if-else:

import ast
df.Number = df.Number.apply(ast.literal_eval)

df.Number = [max(i, default=0) if isinstance(i, list) else i for i in df.Number]

Alternative with apply:

df.Number = df.Number.apply(lambda i: max(i, default=0) if isinstance(i, list) else i)

print (df)
         Date  Number
0  2000-01-01       1
1  2000-01-02       7
2  2000-01-03       8
3  2000-01-04      24
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • No error occurring here although multiple numbers in lists for each row therefore not taking the max of each row. – geds133 Jan 21 '19 at 15:09
  • @geds133 - are values lists and integers? If use `import ast ` and `df.Number = df.Number.apply(ast.literal_eval)` before my solution it working? – jezrael Jan 21 '19 at 15:10
  • Numbers are floats in the lists in my DataFrame. Should have made the example floats and therefore have edited. Would this make a difference? – geds133 Jan 21 '19 at 15:11
  • @geds133 - I think there is no problem with floats. If check `print (df.Number.apply(type))` it return correct lists and integers ? – jezrael Jan 21 '19 at 15:13
  • It appears the type is `` although an example of a row of `df.Number` looks as such: `[403.578, 403.578, 403.578, 586.9599999999999,...` – geds133 Jan 21 '19 at 15:16
  • ok, so use `import ast and df.Number = df.Number.apply(ast.literal_eval)` before my solution – jezrael Jan 21 '19 at 15:17
  • Obtain error: `ValueError: max() arg is an empty sequence` – geds133 Jan 21 '19 at 15:20
  • @geds133 - It meas there are some empry list(s) - what need in output for it? `0` or `NaN` ? – jezrael Jan 21 '19 at 15:23
  • 0 would be most appropriate although `In[38]: df.Number.isnull().sum() Out[38]: 0` – geds133 Jan 21 '19 at 15:24
  • 1
    @geds133 - so use `df.Number = [max(i, default=0) if isinstance(i, list) else i for i in df.Number]` – jezrael Jan 21 '19 at 15:29
  • 1
    Solution has worked, many thanks for your help as always @jezrael – geds133 Jan 21 '19 at 15:33
1

Your data is messy. I suggest you first try and ensure consistent data is fed into your dataframe, ideally via floatseries. Failing this, you can use a nested try / except to cover any number of scenarios covered by your messy data:

df = pd.DataFrame({'Dat': ['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
                           '2000-01-05', '2000-01-06', '2000-01-07'],
                   'Number': ['1', ['2.2', '5.0', '7.8'], '8', ['4', '11.78', '24.66'],
                              np.nan, None, []]})

def calc_max(x):
    try:
        return float(x)
    except TypeError:
        try:
            return max(map(float, x), default=np.nan)
        except TypeError:
            return np.nan

# apply function to each value in 'Number'
df['Number'] = list(map(calc_max, df['Number']))

print(df)

          Dat  Number
0  2000-01-01    1.00
1  2000-01-02    7.80
2  2000-01-03    8.00
3  2000-01-04   24.66
4  2000-01-05     NaN
5  2000-01-06     NaN
6  2000-01-07     NaN

Why your data is messy

Check df['Number'].dtype. If your data is clean / Pandas-friendly, you'll see int or float. But here you see object. This represents a sequence of points to arbitrary Python objects. Then some of those objects are lists, and a list is itself is a sequence of pointers. Hence you have a nested list of pointers as opposed to a numeric array stored in a contiguous block of memory.

jpp
  • 159,742
  • 34
  • 281
  • 339