1

I am trying to get pandas to select the range of rows under the "ClosePrice" from the below structured csv and store it in dataframes. The file has many identifiers but I only want to go through the file with the identifiers in the list below. Also the number of the rows is not always the same.

list = ['ABC0123', 'DEF0123']

>  Column 1  Column 2   Column 3    Column 4   Column 5   Column 6   Column 7
>   "Date"   20170101 "Identifier"   ABC0123
> "OpenPrice"   500     "Currency"      USD
> "ClosePrice"  550       "foo"         bar
>     foo       foo        foo          foo       foo       foo        foo          
>     foo       foo        foo          foo       foo       foo        foo      
>     foo       foo        foo          foo       foo       foo        foo
>   "Date"   20170101 "Identifier"   SOMEOTHER
>     ...
>     ...
>     ...
>   "Date"   20170101 "Identifier"   DEF0123
> "OpenPrice"  600     "Currency"      USD
> "ClosePrice" 650       "foo"         bar
>    foo       foo        foo          foo       foo       foo        foo          
>    foo       foo        foo          foo       foo       foo        foo      
>    foo       foo        foo          foo       foo       foo        foo    
>    foo       foo        foo          foo       foo       foo        foo          
>    foo       foo        foo          foo       foo       foo        foo      
>    foo       foo        foo          foo       foo       foo        foo    
>    foo       foo        foo          foo       foo       foo        foo          
>    foo       foo        foo          foo       foo       foo        foo      
>    foo       foo        foo          foo       foo       foo        foo

I am getting the first row of each table I am interested in with a for-i-loop and:

df.iloc[df[df['Column 4'].isin(list)].index + 3,:]

which goes to the top left cell with a "foo" value and selects the whole row, but I am trying to figure out how to select the rows below that starting point and stop before the next

"Date"   20170101 "Identifier"   SOMEOTHER

One approach I was thinking about was to check for the len of the value of the cell under the last row in Column 5 which would be = 0 , but I am not able to reproduce this logic with scripting. Other approaches more than welcome.

bloo
  • 306
  • 4
  • 13

1 Answers1

1

First dont use list as variables because masking built-in functions.

Create helper column g for distinguish all groups with unique numbers with cumsum. Then get all groups which contains L values and select all rows by another isin:

L = ['ABC0123', 'DEF0123']
df['g'] = df['Column 1'].eq('Date').cumsum()
vals = df.loc[df['Column 4'].isin(L), 'g']
df = df[df['g'].isin(vals)]
print (df)
      Column 1  Column 2    Column 3 Column 4 Column 5 Column 6 Column 7  g
0         Date  20170101  Identifier  ABC0123      NaN      NaN      NaN  1
1    OpenPrice       500    Currency      USD      NaN      NaN      NaN  1
2   ClosePrice       550         foo      bar      NaN      NaN      NaN  1
3          foo       foo         foo      foo      foo      foo      foo  1
4          foo       foo         foo      foo      foo      foo      foo  1
5          foo       foo         foo      foo      foo      foo      foo  1
9         Date  20170101  Identifier  DEF0123      NaN      NaN      NaN  3
10   OpenPrice       600    Currency      USD      NaN      NaN      NaN  3
11  ClosePrice       650         foo      bar      NaN      NaN      NaN  3
12         foo       foo         foo      foo      foo      foo      foo  3
13         foo       foo         foo      foo      foo      foo      foo  3

Last if necessary remove g column:

df = df.drop('g', axis=1)

Similar solution working with index:

L = ['ABC0123', 'DEF0123']
df.index = df['Column 1'].eq('Date').cumsum()
vals = df.index[df['Column 4'].isin(L)]
df = df.loc[vals].reset_index(drop=True)
print (df)
      Column 1  Column 2    Column 3 Column 4 Column 5 Column 6 Column 7
0         Date  20170101  Identifier  ABC0123      NaN      NaN      NaN
1    OpenPrice       500    Currency      USD      NaN      NaN      NaN
2   ClosePrice       550         foo      bar      NaN      NaN      NaN
3          foo       foo         foo      foo      foo      foo      foo
4          foo       foo         foo      foo      foo      foo      foo
5          foo       foo         foo      foo      foo      foo      foo
6         Date  20170101  Identifier  DEF0123      NaN      NaN      NaN
7    OpenPrice       600    Currency      USD      NaN      NaN      NaN
8   ClosePrice       650         foo      bar      NaN      NaN      NaN
9          foo       foo         foo      foo      foo      foo      foo
10         foo       foo         foo      foo      foo      foo      foo

EDIT:

L1 = ['Date','OpenPrice','ClosePrice']
L = ['ABC0123', 'DEF0123']

#if necessary filter rows by L1 
df = df[df['Column 1'].isin(L1)]
df['g'] = df['Column 1'].eq('Date').cumsum()
vals = df.loc[df['Column 4'].isin(L), 'g']
df = df[df['g'].isin(vals)]
print (df)
      Column 1  Column 2    Column 3 Column 4 Column 5 Column 6 Column 7  g
0         Date  20170101  Identifier  ABC0123      NaN      NaN      NaN  1
1    OpenPrice       500    Currency      USD      NaN      NaN      NaN  1
2   ClosePrice       550         foo      bar      NaN      NaN      NaN  1
9         Date  20170101  Identifier  DEF0123      NaN      NaN      NaN  3
10   OpenPrice       600    Currency      USD      NaN      NaN      NaN  3
11  ClosePrice       650         foo      bar      NaN      NaN      NaN  3

For working in groups is possible use groupby with flexible apply

def f(x):
    print (x)
    #some another code
    return x

df1 = df.groupby('g').apply(f)
print (df1)

EDIT:

Final code working with real data:

 L1 = ["Date", "OpenPrice", "ClosePrice"] 
 g = 1 
 for i in list:
     df['g'] = df['Column 4'].isin(list).cumsum() 
     vals = df.loc[df['Column 4'].isin(list), 'g'] 
     df = df[df['g'].isin(vals)] 
     dfFinal = df.loc[(dfLux['g'] == g) & ~df['Column 1'].isin(L1)] 
     g=g+1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I am still trying to implement your suggestions to my dataset but I wanted to ask you about the fact that in your prints the rows with Date, Open Price and Close Price, still come up. My goal is to get one dataframe (with the help of a loop) per range of rows. e.g one dataframe for the first identifier with only those 3 "foo" rows then anotehr dataframe with how many "foo" rows it has and so on and so forth. – bloo Jun 18 '17 at 18:55
  • do you need dictionary of dataframes? – jezrael Jun 18 '17 at 19:01
  • no, just a dataframe, since what I am trying to do with my script is to check for the identifiers, go through a loop for the first identifier, pull the respective "foo" rows into a dataframe, do something some more manipulation with that dataframe, and then do the same with the next identifier and its respective "foo" rows. – bloo Jun 18 '17 at 19:13
  • Yes, but it is possible in dictionary of dataframes, I am working on solution, give me some time. – jezrael Jun 18 '17 at 19:14
  • by all means, take your time, I really appreciate your help on this! – bloo Jun 18 '17 at 19:17
  • I add new code to answer. Now I am going to bed, so if get some problem let me know and yesterday I try find solution. good luck! – jezrael Jun 18 '17 at 19:30
  • Thank you again for your effort but I think what your suggestion does now, is the exact opposite of the goal. I am trying to only get the "foo" rows not the rest. – bloo Jun 18 '17 at 19:47
  • 1
    Then invert mask by ~ instead `df = df[df['Column 1'].isin(L1)]` use `df = df[~df['Column 1'].isin(L1)]` – jezrael Jun 18 '17 at 19:51
  • I am sorry but the above solutions are not achieving what I am trying to do. Essentially what I am trying to implement is a logic where the script goes and picks up the "foo" rows only for any identifier. So essentially we are trying to get a logic that the script knows where to start and where to stop for the 1st identifier and the for-i-loop will take care the rows for the rest of the identifiers. However, all of the above suggestions with the edit only achieve to remove the unnecessary rows but not to specify how many rows should be picked up for each identifier. – bloo Jun 18 '17 at 20:22
  • Sorry, then it is possible I dont understand what you need exactly. Is possible communicate vi email? The best is if you send me some data sample with desired output, I still dont understand logic about selecting rows. My solution works for groups which are separated by text Date in column1. It is wrong? Or dou need find rows by L and then select e.g. 5 rows after matching rows? Now I am only on phone, so tommorrow I try find solution. Thank you. – jezrael Jun 18 '17 at 20:34
  • after a lot of testing and experimenting it seems that you had the right idea but the execution was off by a line of code. One of the lines was misplaced and it was filtering out data before it could actually apply the numbers in the helper column g. L1 = ["Date", "OpenPrice", "ClosePrice"] g =1 for i in list df['g'] = df['Column 4'].isin(list).cumsum() vals = df.loc[df['Column 4'].isin(list), 'g'] df = df[df['g'].isin(vals)] dfFinal = df.loc[(dfLux['g'] == g) & ~df['Column 1'].isin(L1)] g=g+1 – bloo Jun 19 '17 at 00:29
  • which essentially means that: step 1: set a variable equal to 1 step 2: create a list of values that are going to be used for cleanup later step 3: give the helper column values based on the identifiers, that start from 1 step 4: boolean loc that picks up the rows that correspond to the first identifier and clean up the dataframe step 5: increment g, so that when it loops back it picks up the rows for the second identifier. please update your answer with my above comments I will accept your reply since your approach makes sense and works there just was some mix up in the order. Thanks! – bloo Jun 19 '17 at 00:29