can i convert 3-gram txt to iob for crf suite

Question

The txt is in this format of 3-grams:

None,None,kgo,gop,ope,Test_Sepedi
None,kgo,gop,ope,pel,Test_Sepedi
kgo,gop,ope,pel,elo,Test_Sepedi
gop,ope,pel,elo,None,Test_Sepedi
ope,pel,elo,None,None,Test_Sepedi
None,None,gag,ago,None,Test_Sepedi
None,gag,ago,None,None,Test_Sepedi
None,None,gan,ann,nnw,Test_Sepedi
None,gan,ann,nnw,nwe,Test_Sepedi
gan,ann,nnw,nwe,None,Test_Sepedi
ann,nnw,nwe,None,None,Test_Sepedi
None,None,tla,None,None,Test_Sepedi

i want it to be in a format crfsuite will take for training which is this for example:

London JJ B-NP
shares NNS I-NP
closed VBD B-VP
moderately RB B-ADVP
lower JJR I-ADVP
in IN B-PP
thin JJ B-NP
trading NN I-NP

if i can convert it using python will be highly appreciated

How is this input and desired output related? I can't see any correlation between the two. What have you tried so far to solve this yourself? — SiHa, Jan 04 '17 at 07:51
what i want to achieve is any format in which the crf can take for training.i tried using this code for convension but im getting errors : sentences = file.readlines() for sent in sentences: sent = re.sub('\r\n', '', sent) sent = re.sub(' +', ' ', sent) sent = sent.replace("\$", "$") sent = sent.replace("---", "--") sent = sent.replace("&", "&") sent = sent.strip() print >>file2, re.sub('<[^>]*>', '', sent) print >>file2, sent print >>file2, nltk.pos_tag(nltk.word_tokenize(re.sub('<[^>]*>', '', sent))) file.close() file2.close() — Juwaki Ledwaba, Jan 04 '17 at 08:04
Please [edit] your original post to include the code - comments are not the place for it. Also, you should include the Traceback you are getting — SiHa, Jan 04 '17 at 08:05

score 0 · Answer 1 · answered Jan 04 '17 at 10:45

cant see what you r trying to do i just give you my thoughts

out_file = open('./out', 'w')
for line in open('./in'):
    #do what ever you want to with input
    #and write output to output file
    out_file.write(result+'\n')
out_file.close()

hope this is helpful

score 0 · Accepted Answer · answered Jan 09 '17 at 06:15

By the looks of the question, I assume that the input file is in csv format and the IOB2 format looks as though it is space or tab separated tokens. So the simplest way to achieve that format would be to read each line and replace the comma delimiter with a space.



    # fill in your paths here, do not copy and paste 
    output = open(OUTFILE_PATH, 'w')
    input = open(INPUT_PATH,'r') 
    data = input.readlines()
    input.close()

    for line in data:
        output_line = line.replace("\n","")
        # if the format requires a space then replace with a space
        # if the format requires a tab then replace with a tab
        # since your file seems to be comma separated, 
        #that is why I replace the comma below with a space

        output_line = output_line.replace(","," ")
        out_file.write(output_line+'\n')
    out_file.close()

Hope this helps!

check the out_file line where did you declare it – Juwaki Ledwaba Jan 12 '17 at 12:02 — Juwaki Ledwaba, Jan 12 '17 at 12:02
Sorry, out_file was meant to be output, typo! – Avi Jan 17 '17 at 13:58 — Avi, Jan 17 '17 at 13:58

can i convert 3-gram txt to iob for crf suite

2 Answers2