0

Is there a way to filter data before or while read into a data frame?

For example, I have the following csv datafile:

 time       Event    price     Volume
00:00:00.000, B,    920.5,    57
00:00:00.000, A,    920.75,   128
00:00:00.898, T,    920.75,   1
00:00:00.898, T,    920.75,   19
00:00:00.906, B,    920.5,    60
00:00:41.284, T,    920.75,   5
00:00:57.589, B,    920.5,    53
00:01:06.745, T,    920.75,   3
00:01:06.762, T,    920.75,   2

I would like to read rows of data where 'Event'=='T' and 'Volume'>=100 only. It is very easy to accomplish if we read the entire dataset in and then filter out the data (and that is what I am doing right now).

Each of the file I have is 10MB and there are thousands of them (about 15 GB data in total), this procedure will take forever. So I am wondering if there is a way to filter the data while reading in, or some other methods to speed things up a little. Maybe use a database instead?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user6396
  • 1,832
  • 6
  • 23
  • 38
  • 1
    possible duplicate here? http://stackoverflow.com/questions/13651117/pandas-filter-lines-on-load-in-read-csv – JohnE Jun 26 '15 at 01:31

2 Answers2

1

I don't believe there is a way to filter what you want read in a csv file.

Try using HDFStore. It offers vastly better performance for reading and writing. You could read all your data once from CSV and save it to H5 files and they use these H5 files as your database. Some comparisons results are on this page,

http://pandas.pydata.org/pandas-docs/dev/io.html

I am copying the results here for comparision,

Write Performance,

In [15]: %timeit test_hdf_fixed_write(df)
1 loops, best of 3: 237 ms per loop

In [26]: %timeit test_hdf_fixed_write_compress(df)
1 loops, best of 3: 245 ms per loop

In [16]: %timeit test_hdf_table_write(df)
1 loops, best of 3: 901 ms per loop

In [27]: %timeit test_hdf_table_write_compress(df)
1 loops, best of 3: 952 ms per loop

In [17]: %timeit test_csv_write(df)
1 loops, best of 3: 3.44 s per loop

Read Performance,

In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop

In [28]: %timeit test_hdf_fixed_read_compress()
10 loops, best of 3: 36.3 ms per loop

In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop

In [29]: %timeit test_hdf_table_read_compress()
10 loops, best of 3: 60.6 ms per loop

In [22]: %timeit test_csv_read()
1 loops, best of 3: 620 ms per loop
nitin
  • 7,234
  • 11
  • 39
  • 53
0

One option is to filter the csv files prior to loading using tools like awk.

Useful awk example for filtering based on numerical range is here.

Useful awk range for string value filtering is here.

Community
  • 1
  • 1
Jariani
  • 438
  • 4
  • 6