1

I have a txt file in, theoretically, CoNLL format. Like this:

a O
nivel B-INDC
de O
la O
columna B-ANAT
anterior I-ANAT
del I-ANAT
acetabulo I-ANAT


existiendo O
minimos B-INDC
cambios B-INDC
edematosos B-DISO
en O
la O
medular B-ANAT
(...)

I need to convert it into a list of sentence, but I don't find a way to do it. I tried with the parser of conllu library:

from conllu import parse
sentences = parse("location/train_data.txt")

but they give the error: ParseException: Invalid line format, line must contain either tabs or two spaces.

How can I get this?

["a nivel de la columna anterior del acetabulo", "existiendo minimos cambios edematosos en la medular", ...]

Thanks

Andrea NR
  • 1,357
  • 1
  • 5
  • 14

3 Answers3

2

for NLP Problems, the first starting point is Huggingface - always for me - :D There is a nice example for your problem: https://huggingface.co/transformers/custom_datasets.html

Here they show a function that is exactly doing what you want:

from pathlib import Path
import re

def read_wnut(file_path):
    file_path = Path(file_path)

    raw_text = file_path.read_text().strip()
    raw_docs = re.split(r'\n\t?\n', raw_text)
    token_docs = []
    tag_docs = []
    for doc in raw_docs:
        tokens = []
        tags = []
        for line in doc.split('\n'):
            token, tag = line.split('\t')
            tokens.append(token)
            tags.append(tag)
        token_docs.append(tokens)
        tag_docs.append(tags)

    return token_docs, tag_docs

texts, tags = read_wnut("location/train_data.txt")
MarkusOdenthal
  • 1,074
  • 8
  • 7
  • the following URL is no longer valid https://huggingface.co/transformers/custom_datasets.html – Mai Aug 13 '22 at 10:50
1

Simplest thing is to iterate over the lines of your file and then to retrieve the first column. No imports required.

result=[[]]
with open(YOUR_FILE,"r") as input:
    for l in input:
        if not l.startswith("#"):
            if l.strip()=="":
                if len(result[-1])>0:
                    result.append([])
            else:
                result[-1].append(l.split()[0])
result=[ " ".join(row) for row in result ]

In my experience, writing these from hand is the most effective way, because CoNLL formats are terribly diverse (but usually in trivial ways, such as order of columns) and you don't want to bother with other people's code for anything that can be so simply solved. The code quoted by @markusodenthal will, for example, maintain CoNLL comments (lines starting with #) -- which may not be what you want.

The other thing is that writing the loop yourself allows you to process sentence by sentence rather than first reading everything into an array. If you don't need processing en bloc, this will be both faster and more scalable.

Chiarcos
  • 324
  • 1
  • 10
0

You can use the conllu library.

Install using pip install conllu.

A sample use-case is shown below.

>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]

AVISHEK GARAIN
  • 715
  • 5
  • 6