Is there a way to filter when reading data with pandas.read_table?

Question

I'm using python 3.5 and pandas 0.19.2.

Using pandas.read_table, is there a way to filter when reading data?

In my example below, I read in my initial data frame and then subset the rows I want based on a condition. Is there a way to do this, or any way to dramatically speed the example below up? I couldn't see anything in the pandas.read_table docs (link), that showed how to speed this up.

Currently it takes around 3 minutes.

import pandas as pd
from datetime import datetime

start_time = datetime.now()
# reading table
df = pd.read_table('https://download.bls.gov/pub/time.series/ce/ce.data.0.AllCESSeries', sep='\t', header=0)
# subsetting
df = df[df['series_id'].str.contains("CEU0000000001")]

end_time = datetime.now()
run_time = end_time-start_time

print(run_time)

Possible duplicate of [pandas: filter lines on load in read\_csv](https://stackoverflow.com/questions/13651117/pandas-filter-lines-on-load-in-read-csv) — erasmortg, Jul 29 '17 at 14:26

MaxU - stand with Ukraine · Accepted Answer · 2017-07-29T14:34:40.387

2

Consider using alternative storage format if you want to speed up reading from disk significantly.

I'd consider using HDF5 or Feather formats.

PS HDF Store allows us to index data and to read it per index. So we will read from disk only that data that we need - no need to read up everything from disk to memory and filter data in memory.

edited Jul 29 '17 at 14:34

answered Jul 29 '17 at 14:25

MaxU - stand with Ukraine

205,989
36
386
419

Is there a way to filter when reading data with pandas.read_table?

1 Answers1