slicing a portion of data within a certain x range using python(or pandas dataframe)?

Question

I am running into issues with how I am using append operator in python 3.x. In my python code, I am trying to remove data points that has y value of 0. My data looks like this:

x          y
400.01  0.000e0
420.02  0.000e0
450.03  10.000e0
48.04   2.000e0
520.05  0.000e0
570.06  0.000e0
570.23  5.000e0
600.24  0.000e0
620.25  3.600e-1
700.26  8.400e-1
900.31  2.450e0

I want to extract data that fall under a certain x range. For instance, I would like to get x and y values where x is greater than 520 but less than 1000.

Desired output would look like..

  x        y
520.05  0.000e0
570.06  0.000e0
570.23  5.000e0
600.24  0.000e0
620.25  3.600e-1
700.26  8.400e-1
900.31  2.450e0

The code I have so far looks like below.

import numpy as np
import os

myfiles = os.listdir('input')

for file in myfiles:
    with open('input/'+file, 'r') as f:
        data = np.loadtxt(f,delimiter='\t') 


        for row in data: ## remove data points where y is zero
            data_filtered_both = data[data[:,1] != 0.000]
            x_array=(data_filtered_both[:,0])
            y_array=(data_filtered_both[:,1])
            y_norm=(y_array/np.max(y_array))
            x_and_y= np.array([list (i) for i in zip(x_array,y_array)])

    precursor_x=[]
    precursor_y=[]
    for precursor in row: ## get data points where x is 
        precursor = x_and_y[:, np.abs(x_and_y[0,:]) > 520 and np.abs(x_and_y[0,:]) <1000]
        precursor_x=np.array(precursor[0])
        precursor_y=np.array(precursor[1])

I get an error message that says..

  File "<ipython-input-45-0506fab0ad9a>", line 4, in <module>
    precursor = x_and_y[:, np.abs(x_and_y[0,:]) > 2260 and np.abs(x_and_y[0,:]) <2290]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

How should I go about this? Any recommended operator that I could use?

P.S I realize pandas dataframe is quite useful to deal with dataset like this. I am not very familiar with pandas language, but open to using it if necessary. Therefore, I will add pandas as my tag as well.

Use pandas query wqith smth. like `df2=data.query("x>500 & x<1000")` — 2Obe, Aug 15 '17 at 19:08

score 4 · Answer 1 · answered Aug 15 '17 at 19:11

You can use between with boolean indexing:

df = df[df['x'].between(520,1000)]
print (df)
         x     y
4   520.05  0.00
5   570.06  0.00
6   570.23  5.00
7   600.24  0.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

...and for remove 0 from y columns:

df = df[df['x'].between(520,1000) & (df['y'] != 0)]
print (df)
         x     y
6   570.23  5.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

Or query as commented 2Obe:

df = df.query("x>500 & x<1000")
print (df)
         x     y
4   520.05  0.00
5   570.06  0.00
6   570.23  5.00
7   600.24  0.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

If need also filter out 0 in column y:

df = df.query("x>500 & x<1000 & y != 0")
print (df)
         x     y
6   570.23  5.00
8   620.25  0.36
9   700.26  0.84
10  900.31  2.45

Carlos Muñiz · Accepted Answer · 2017-08-15T21:15:54.213

I agree with the previous answers using Pandas as it is much simpler. If you would like to go without it, I would suggest breaking down the logic into two parts:

row = np.array([[1, 0], [2, 0], [3, 7], [4, 8], [5, 9]])
print(row)

array([[1, 0], [2, 0], [3, 7], [4, 8], [5, 9]])

x_and_y = []
for x, y in row: ## remove data points where y is zero
    if y > 0:
        x_and_y.append((x, y))
print(x_and_y)

[(3, 7), (4, 8), (5, 9)]

precursor_x = []
precursor_y = []
for x, y in x_and_y: ## get data points where x is
    if x > 3 and x < 9:
        precursor_x.append(x)
        precursor_y.append(y)
print(precursor_x, precursor_y)

[4, 5] [8, 9]

This gets you all of the X into precursor_x and all of the Y in precursor_y. You could them zip them if you wish to:

np.array(list(zip(precursor_x, precursor_y)))

array([[4, 8], [5, 9]])

Alz · Answer 3 · 2017-08-15T20:07:33.087

df[(df['x'] > 520) & (df['x'] < 1000) & (df['y'] != 0)]

Update:

I remember I read somewhere that query() is faster for huge dfs, but my simple benchmarks shows that df[(df['x'] > 520) & (df['x'] < 1000)] is always faster than query().

df1 = pd.DataFrame({"X":np.random.randint(100,1300,10000),"Y":np.random.randint(0,200,10000)})
df2 = pd.DataFrame({"X":np.random.randint(100,1300,1000000),"Y":np.random.randint(0,200,1000000)})
df3 = pd.DataFrame({"X":np.random.randint(100,1300,100000000),"Y":np.random.randint(0,200,100000000)})

Small DataFrame:

%timeit df1[(df1['X'] > 520) & (df1['X'] < 1000)]
%timeit df1.query('X > 520 & X < 1000')
%timeit df1[df1['X'].between(520,1000)]
#2.46 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#5.05 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#2.45 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

medium DataFrame:

%timeit df2[(df2['X'] > 520) & (df2['X'] < 1000)]
%timeit df2.query('X > 520 & Y < 1000')
%timeit df2[df2['X'].between(520,1000)]
#31.2 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#42.8 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#32.3 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Large DataFrame:

%timeit df3[(df3['X'] > 520) & (df3['X'] < 1000)]
%timeit df3.query('X > 520 & Y < 1000')
%timeit df3[df3['X'].between(520,1000)]
#4.04 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#6.37 s ± 56.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#3.68 s ± 38.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Interestingly, df3[df3['X'].between(520,1000)] is the fastest for the biggest df. The relative difference between first and third options is shrinking as the df gets bigger, so maybe at some point (or in some other situations) query() perform better.

It's already there, df1:10000, df2:1000000, df3:100000000 rows. — Alz, Aug 16 '17 at 10:28

score 0 · Answer 4 · answered Aug 15 '17 at 19:16

df=pd.DataFrame({"X":np.random.randint(100,1000,10),"Y":np.random.randint(0,0.001,10)})

df2=df.query("X>5 & X<1000")


print(df2)

     X         Y
0  188 -0.923096
1  953  1.327985
2  190 -0.970169
3  975  0.819512
4  900 -0.782465
5  180  0.357470
6  874  1.746500
7  369  0.078113
8  287  1.642208
9  739  2.238841

slicing a portion of data within a certain x range using python(or pandas dataframe)?

4 Answers4