-2

I am working on 2019 Data Science Bowl.The training and testing data is taking a long time when I am using pandas to read it ,I want to reduce the time so that the machine can run the analysis efficiently.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True) 

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
keep_cols = ['event_id', 'game_session', 'installation_id', 'event_count', 'event_code', 'title', 'game_time', 'type', 'world']
specs_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/specs.csv')
train_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/train.csv',usecols=keep_cols)
test_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/test.csv')
train_labels_df = pd.read_csv('/kaggle/input/data-science-bowl-2019/train_labels.csv')
Chitreshkr
  • 161
  • 3
  • 14

1 Answers1

1

Pandas read_csv method has a chunksize argument yields a certain number of rows as an iterator. This is useful for very large data sets where you can train on a smaller subset of the data iteratively.

More information on iterating through files is described in the documentation here.

hoffee
  • 470
  • 3
  • 16
  • in addition if you specify your dtypes before hand and your datetime formats this greatly speeds up reading in dataframes. have a read of this very useful https://realpython.com/fast-flexible-pandas/ – Umar.H Nov 13 '19 at 22:07