1

I'm using python 3.5 and pandas 0.19.2.

Using pandas.read_table, is there a way to filter when reading data?

In my example below, I read in my initial data frame and then subset the rows I want based on a condition. Is there a way to do this, or any way to dramatically speed the example below up? I couldn't see anything in the pandas.read_table docs (link), that showed how to speed this up.

Currently it takes around 3 minutes.

import pandas as pd
from datetime import datetime

start_time = datetime.now()
# reading table
df = pd.read_table('https://download.bls.gov/pub/time.series/ce/ce.data.0.AllCESSeries', sep='\t', header=0)
# subsetting
df = df[df['series_id'].str.contains("CEU0000000001")]

end_time = datetime.now()
run_time = end_time-start_time

print(run_time)
patrickjlong1
  • 3,683
  • 1
  • 18
  • 32
  • Possible duplicate of [pandas: filter lines on load in read\_csv](https://stackoverflow.com/questions/13651117/pandas-filter-lines-on-load-in-read-csv) – erasmortg Jul 29 '17 at 14:26

1 Answers1

2

Consider using alternative storage format if you want to speed up reading from disk significantly.

I'd consider using HDF5 or Feather formats.

PS HDF Store allows us to index data and to read it per index. So we will read from disk only that data that we need - no need to read up everything from disk to memory and filter data in memory.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419