-3

I am working with Python "Pattern.en" package that gives me the subject, object and other details about a particular sentence.

But I want to store this output into another variable or a Dataframe for further processing which I am not able to do so.

Any inputs on this will be helpful.

Sample code is mentioned below for reference.

from pattern.en import parse
from pattern.en import pprint
import pandas as pd

input = parse('I want to go to the Restaurant as I am hungry very much')
print(input)    
I/PRP/B-NP/O want/VBP/B-VP/O to/TO/I-VP/O go/VB/I-VP/O to/TO/O/O the/DT/B-NP/O Restaurant/NNP/I-NP/O as/IN/B-PP/B-PNP I/PRP/B-NP/I-PNP am/VBP/B-VP/O hungry/JJ/B-ADJP/O very/RB/I-ADJP/O much/JJ/I-ADJP/O

pprint(input)

      WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA                                                
         I   PRP    NP       -      -      -      -       
      want   VBP    VP       -      -      -      -       
        to   TO     VP ^     -      -      -      -       
        go   VB     VP ^     -      -      -      -       
        to   TO     -        -      -      -      -       
       the   DT     NP       -      -      -      -       
Restaurant   NNP    NP ^     -      -      -      -       
        as   IN     PP       -      -      PNP    -       
         I   PRP    NP       -      -      PNP    -       
        am   VBP    VP       -      -      -      -       
    hungry   JJ     ADJP     -      -      -      -       
      very   RB     ADJP ^   -      -      -      -       
      much   JJ     ADJP ^   -      -      -      -       

Please note the output of both print and pprint statements. I am trying to store either one of them into a variable. It would be better if I can store the output of pprint statement into a Dataframe as it is printing in tabular format.

But when I try to do so I encounter the error mentioned below

df = pd.DataFrame(input)

ValueError: DataFrame constructor not properly called!

pacholik
  • 8,607
  • 9
  • 43
  • 55
JKC
  • 2,498
  • 6
  • 30
  • 56
  • Seems basic, have you read the documentation of Pandas? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html Your error says you're not calling the constructor correctly - and that seems indeed to be the case. – Jacob Bruinsma Oct 20 '17 at 12:00
  • Thanks @Jacob. But my problem is not how to resolve the error I got. It is how to store the output of pattern.en package into a variable or Dataframe. So please let me know if you have any idea on that. Hope this is not a basic one and you can rethink to remove the downvote if you think this not the basic one – JKC Oct 20 '17 at 12:43

1 Answers1

1

Taking source of table function, I come out with this

from pattern.en import parse
from pattern.text.tree import WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA, IOB, ROLE, MBSP, Text
import pandas as pd

def sentence2df(sentence, placeholder="-"):
    tags  = [WORD, POS, IOB, CHUNK, ROLE, REL, PNP, ANCHOR, LEMMA]
    tags += [tag for tag in sentence.token if tag not in tags]
    def format(token, tag):
        # Returns the token tag as a string.
        if   tag == WORD   : s = token.string
        elif tag == POS    : s = token.type
        elif tag == IOB    : s = token.chunk and (token.index == token.chunk.start and "B" or "I")
        elif tag == CHUNK  : s = token.chunk and token.chunk.type
        elif tag == ROLE   : s = token.chunk and token.chunk.role
        elif tag == REL    : s = token.chunk and token.chunk.relation and str(token.chunk.relation)
        elif tag == PNP    : s = token.chunk and token.chunk.pnp and token.chunk.pnp.type
        elif tag == ANCHOR : s = token.chunk and token.chunk.anchor_id
        elif tag == LEMMA  : s = token.lemma
        else               : s = token.custom_tags.get(tag)
        return s or placeholder

    columns = [[format(token, tag) for token in sentence] for tag in tags]
    columns[3] = [columns[3][i]+(iob == "I" and " ^" or "") for i, iob in enumerate(columns[2])]
    del columns[2]
    header = ['word', 'tag', 'chunk', 'role', 'id', 'pnp', 'anchor', 'lemma']+tags[9:]

    if not MBSP:
        del columns[6]
        del header[6]

    return pd.DataFrame(
        [[x[i] for x in columns] for i in range(len(columns[0]))],
        columns=header,
    )

Usage

>>> string = parse('I want to go to the Restaurant as I am hungry very much')
>>> sentence = Text(string, token=[WORD, POS, CHUNK, PNP])[0]
>>> df = sentence2df(sentence)
>>> print(df)
          word  tag   chunk role id  pnp lemma
0            I  PRP      NP    -  -    -     -
1         want  VBP      VP    -  -    -     -
2           to   TO    VP ^    -  -    -     -
3           go   VB    VP ^    -  -    -     -
4           to   TO       -    -  -    -     -
5          the   DT      NP    -  -    -     -
6   Restaurant  NNP    NP ^    -  -    -     -
7           as   IN      PP    -  -  PNP     -
8            I  PRP      NP    -  -  PNP     -
9           am  VBP      VP    -  -    -     -
10      hungry   JJ    ADJP    -  -    -     -
11        very   RB  ADJP ^    -  -    -     -
12        much   JJ  ADJP ^    -  -    -     -
pacholik
  • 8,607
  • 9
  • 43
  • 55