0

I'm creating a new DataFrame from scratch, but I'm not sure the way I'm doing it is the most efficient way.

I'm creating:

  • column Never where 3070 = 1
  • column Occasional 1100 = 1
  • column Frequent 2200 = 1

I'm also creating a new column Police:

  • where 70 rows = 1 and column Never = 1
  • where 110 rows = 1 and column Occasional = 1
  • where 220 rows = 1 and column Frequent = 1

Code:

# create dataframes for each column
df1 = pd.concat([pd.DataFrame([1], columns=['NEVER']) for i in range(3070)],
          ignore_index=True)

df2 = pd.concat([pd.DataFrame([1], columns=['OCCASIONAL']) for i in range(1100)],
          ignore_index=True)

df3 = pd.concat([pd.DataFrame([1], columns=['FREQUENT']) for i in range(2200)],
          ignore_index=True)

# combine dataframes into one
frames = [df1, df2, df3]
df = pd.concat(frames)

# reset index
df = df.reset_index(drop=True)

df['POLICE'] = 0.0

# replace police column values
df.loc[0:69,'POLICE']=1.0
df.loc[3071:3180,'POLICE']=1.0
df.loc[5271:5490,'POLICE']=1.0

# convert NaN into 0
values=(0.0)
df = df.fillna(value=values)

I think I've done it, but my code takes ages to process. Is it a normal thing as I'm creating 6000+ rows or my code is inefficient?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
laminado
  • 69
  • 1
  • 3
  • 11

2 Answers2

3

You can fill the column with ones and zeros using np.ones() and np.zeros. Using numpy you can obtain a substantial speedup.

import pandas as pd
import numpy as np

# create dataframes for each column
df1 = pd.DataFrame(np.ones(3070), columns=['NEVER'])

df2 = pd.DataFrame(np.ones(1100), columns=['OCCASIONAL'])

df3 = pd.DataFrame(np.ones(2200), columns=['FREQUENT'])

# combine dataframes into one
frames = [df1, df2, df3]
df = pd.concat(frames)

# reset index
df = df.reset_index(drop=True)

df['POLICE'] = np.zeros(6370)

# replace police column values
df.loc[0:69,'POLICE']=np.ones(70)
df.loc[3071:3180,'POLICE']=np.ones(110)
df.loc[5271:5490,'POLICE']=np.ones(220)

# convert NaN into 0
values=(0.0)
df = df.fillna(value=values)

In my machine - original code:

Process finished --- 2.513995409011841 seconds ---

Modified code:

Process finished --- 0.0069921016693115234 seconds ---
frab
  • 1,162
  • 1
  • 4
  • 14
1

I suggest an entirely different approach which is far more efficient. Create a 2D list of your data, then turn it into a dataframe as one piece.

lst = []
for row in range(6370):
    lst.append([None, None, None, None])
    for col in range(4):
        if (col == 0 and row < 3070)\
                or (col == 1 and row >= 3070 and row < 1100)\
                or (col == 2 and row >= 4170)\
                or (col == 3 and row < 70)\
                or (col == 3 and row > 3070 and row <= 3180)\
                or (col == 3 and row > 5270 and row <= 5490):
            lst[row][col] = 1.0
        else:
            lst[row][col] = 0.0


df = pd.DataFrame(lst)
df.columns = ["NEVER", "OCCASIONAL", "FREQUENT", "POLICE"]
print(df)

Here is the output:

enter image description here

pakpe
  • 5,391
  • 2
  • 8
  • 23