Creating new columns with Pandas df.apply

Question

I have a huge NetFlow database, (it contains a Timestamp, Source IP, Dest IP, Protocol, Source and Dest Port Num., Packets Exchanged, Bytes and more). I want to create custom attributes based on the current and previous rows.

I want to calculate new columns based on the source ip and timestamp of the current row. This what i want to do logically:

Get the source ip for the current row.
Get the Timestamp for the current row.
Based on the source IP, and Timestamp, I want to get all the Previous rows of the entire dataframe, that matches the source IP, and the communicaton happened in the last half an hour. This is very important.
For the rows(Flows, in my example), that matches the criteria (source ip and happened in the last half hour), I want to count the sum and mean of all the packets and all the bytes.

One row from the dataset

Snippets of relevant code:

df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])

df['ts'] = pd.to_datetime(df['ts'])

def prev_30_ip_sum(ts,sa,size):
global joined
for (x,y) in zip(df['sa'], df['ts']):
    ...
return sum

df['prev30ipsumpkt'] = df.apply(lambda x: prev_30_ip_sum(x['ts'],x['sa'],x['pkt']), axis = 1)

I know that there's probably a better, more efficient way to do this, but I'm sadly not the best programmer.

Thanks.

`df = pd.read_csv(path, header = None, names=['te','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])` I don't see any `'ts'` in this line. But you are converting it in datetime in another row. What does te and td represent in dataframe ? — Punit Vara, Oct 07 '20 at 11:13
td is time duration, and te supposed to be time established. basically the timestamp. I edited the snippet, to be consistent. sorry. — chad, Oct 07 '20 at 11:18
thanks @chad. So let me know if I am right. You want to figure out all the rows which has same source address you are looking for and time stamp should be within last 30 min right ? — Punit Vara, Oct 07 '20 at 11:24
You say "I want to get all the Previous rows of the entire dataframe, that matches the source IP, and the communicaton happened in the last half an hour. ". Does it really need to be "the previous rows", or what you're looking for is that the timestamp of the rows you should consider should have happened within the 30 mins before the current timestamp? — tania, Oct 07 '20 at 11:25
that is exactly right @PunitVara. And I want to create the new column(new attribute for the dataset) based on those rows. So I want to calculate this for every row in the dataset. — chad, Oct 07 '20 at 11:33
@tania My dataset is ordered by timestamp, so by looking at the last 30 mins, you're basically looking for the previous rows that are in that timeframe — chad, Oct 07 '20 at 11:36
@chad please check the answer. Accept answer if it is useful — Punit Vara, Oct 07 '20 at 12:27

score 2 · Answer 1 · answered Oct 07 '20 at 11:53

Documented inline

from datetime import timedelta

def fun(df, i):
  # Current timestamp
  current = df.loc[i, 'ts']
  # timestamp of last 30 minutes
  last = current - timedelta(minutes=30)
  # Current IP
  ip = df.loc[i, 'sa']
  
  # df matching the criterian
  adf = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == ip)]

  # Return sum and mean
  return adf['pkt'].sum(), adf['pkt'].mean()

# Apply the fun over each row
result = [fun(df, i) for i in df.index]

# Create new columns
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]

Punit Vara · Accepted Answer · 2020-10-08T10:24:35.087

df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
        
df['ts'] = pd.to_datetime(df['ts'])
   
def prev_30_ip_sum(df, i):
  #current time from current row
  current = df.loc[i, 'ts']
  # timestamp of last 30 minutes 
  last = current - timedelta(minutes=30)

  # Current source address
  sa = df.loc[i, 'sa']

  # new dataframe for timestamp less than 30 min and same ip as current one
  new_df = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == sa)]

  # Return sum and mean
  return new_df['pkt'].sum(), new_df['pkt'].mean()


# Take sa and timestamp of each row and create new dataframe
result = [prev_30_ip_sum(df, i) for i in df.index]

# Create new columns in current database.
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]

refer this to understand timedelta

I get the following error. : current = df.loc(i, 'ts') TypeError: __call__() takes from 1 to 2 positional arguments but 3 were given — chad, Oct 08 '20 at 06:37

Creating new columns with Pandas df.apply

2 Answers2