0

I am trying to locate all the rows of a dataframe whose one attribute (say id_proof) value matches to the second part of another column (say adr_proof) that starts with a fixed word (say PARENT) and also the corresponding values should match which are part of the same dataframe.

For example, in the dataframe :

import pandas as pd

main = {'account_number' : [1,2,3,4,5,6,7,8,9,10,11,12],
    'id_proof' : ['A','B','B','A','C','C','X','Y','X','Y','Y','X'],
    'id_value' : [101,201,301,401,501,601,111,222,333,444,555,666],
    'adr_proof' : ['Z','E','E','G','G','I','PARENT A','PARENT B','PARENT   B','PARENT C','PARENT C','PARENT A'],
    'adr_value' : [11,22,33,44,55,66,101,201,301,501,601,401]}
main = pd.DataFrame(main)

I am trying to achieve :

node1    node2    relation
  1        7      parent-child
  2        8      parent-child
  3        9      parent-child
  4       12      parent-child
  5       10      parent-child
  6       11      parent-child

Below is my code. I am aware that my code is incomplete. I am stuck with the split() function. I am new to python and pandas and am not sure how to invoke pandas' split() function rather than python's built-in str.split() function. I have gone through this question

import pandas as pd

main = {'account_number' : [1,2,3,4,5,6,7,8,9,10,11,12],
    'id_proof' : ['A','B','B','A','C','C','X','Y','X','Y','Y','X'],
    'id_value' : [101,201,301,401,501,601,111,222,333,444,555,666],
    'adr_proof' : ['Z','E','E','G','G','I','PARENT A','PARENT B','PARENT B','PARENT C','PARENT C','PARENT A'],
    'adr_value' : [11,22,33,44,55,66,101,201,301,501,601,401]}
main = pd.DataFrame(main)

df_group_count = pd.DataFrame({'count' : main.groupby(['adr_proof']).size()}).reset_index()
adr_type = df_group_count['adr_proof']
adr_type_parent = adr_type.loc[adr_type.str.startswith('PARENT',na=False)]

df_j_ = pd.DataFrame()
for j in adr_type_parent:
    dfn_j = main.loc[(main['adr_proof'] == j)]
    adr_type_parent_type = j.split(' ',expand=True,n=1)
    res = main.loc[(main['id_proof'] == adr_type_parent_type[1]) & (main['id_value'] == dfn_j['adr_value'])]

res

Please provide a way to achieve my goal. The output is another dataframe. Please excuse for bad code or any violations. A completely different approach is also appreciated. Thank You.

Community
  • 1
  • 1
koushal
  • 37
  • 8
  • Try using `re.split(' +',j,maxsplit=1)` or a similar fine-tuned version instead, from the `re` module. – Andras Deak -- Слава Україні Mar 31 '17 at 11:31
  • I would like to use the pandas.series.str.split() so that the result can be a dataframe. – koushal Mar 31 '17 at 11:48
  • And what kind of dataframe do you intend to construct from a string that is split? In your loop `j` is just a string, so you need to do whatever you want to do with the split string yourself. But your next line uses `main['id_proof'] == adr_type_parent_type[1]` and that should work fine with a string on the right-hand side...doesn't it? – Andras Deak -- Слава Україні Mar 31 '17 at 11:50
  • The resulting dataframe I intend to construct from that line is a dataframe with two columns first having values 'PARENT' and second column with values 'A', 'B', 'C'. – koushal Mar 31 '17 at 11:57

3 Answers3

1

You can't invoke str.split() of the pandas library in your particular case because you are using the DataFrame object, and this particular object does not implement str.split(). Only the Series object implements str.split().

Boštjan Mejak
  • 827
  • 9
  • 23
1

Since your main question seems to be how to incorporate pandas split function:

You can isolate the rows containing the keyword 'PARENT' using this:

parent_main = main[main.adr_proof.str.split(' ').str[0] == 'PARENT']

Now, you can easily extract the second value:

parent_main.adr_proof.str.split(' ').str[-1]
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
  • i suppose the second line is : parent_main.adr_proof.str.splt(' ').str[-1] – koushal Mar 31 '17 at 12:08
  • 1
    This works!! Thanks. Solves my problem. But i am a little curious to know the answer for the crux of the question. Can we overshadow the function? – koushal Mar 31 '17 at 12:13
  • What exactly do you mean? You can experiment with something like `from module import something as str`. However I don't see the benefit here. – Arco Bast Mar 31 '17 at 12:16
  • I want to know if i can invoke pandas.series.str.split function on a string object rather than python's built in string split function. – koushal Mar 31 '17 at 13:56
  • You can use the pandas version of split on a pandas series containing strings like I showed you, but you cannot use it on a string object directly. – Arco Bast Mar 31 '17 at 14:47
0

After investigating this and also discussing about this in the #python channel on IRC network freenode.net, I have an answer for you. You can't overshadow str.split() of Python with the str.split() of the pandas library.

Also, the DataFrame object has no str.split(). I have read the whole API and also played with from ... import ... to somehow import str.split() from pandas and overshadow the str.split() of Python.

The DataFrame object you are using in your code has no str.split(). The only reason str.split() in your code doesn't throw an error is because it just so happens Python has str.split() built in and uses that.

The only pandas object I could find that has str.split() is the Series object, pandas.Series.str.split(). But you're not using the Series object, you are using the DataFrame object. I'm sorry, there's nothing to be done.

If you ask me, the structure of pandas is broken. You can't just import str.split(), because str is basically a StringMethods object and this object lives under the strings package, which lives in the core package, which lives in the pandas top-level package. It's a mess! I wasted 2 hours of my life to understand its package/module/object structure.

Also, pandas.Series.str.split() is basically pandas.core.series.Series.str.split(). I just gave up!

Try to import str.split() from pandas and you'll get a Nobel prize!

Boštjan Mejak
  • 827
  • 9
  • 23
  • The pandas.Series.str.split method is simply not meant to be imported. However, it is super useful and extremely fast for doing string operations on Series. – Arco Bast Mar 31 '17 at 15:01
  • Well, the OP wanted to overshadow the built-in str.split of Python with the str.split of pandas. But I already got the information from the Python experts on the IRC network Freenode that this is not possible, so this question wasted our lives. Moving on… – Boštjan Mejak Mar 31 '17 at 15:11