I am currently working on a project where I want to classify some text. For that, I first had to annotate text data. I did it using a web tool and have now the corresponding json file (containing the annotations) and the plain txt files (containing the raw text). I now want to use different classifiers to train the data and eventually predict the desired outcome.
However, I am struggling with where to start. I haven't really found what I've been looking for in the internet so that's why I try it here.
How would I proceed with the json and txt. files? As far as I understood I'd have to somehow convert these info to a .csv where I have information about the labels, the text but also "none" for thext that has not been annotated. So I guess that's why I use the .txt files to somehow merge them with the annotations files and being able to detect if a text sentence (or word) has a label or not. And then I could use the .csv data to load it into the model.
Could someone give me a hint on where to start or how I should proceed now? Everything I've found so far is covering the case that data is already converted and ready to preprocess but I am struggling with what to do with the results from the annotation process.
My JSON looks something like that:
{"annotatable":{"parts":["s1p1"]},
"anncomplete":true,
"sources":[],
"metas":{},
"entities":[{"classId":"e_1","part":"s1p1","offsets":
[{"start":11,"text":"This is the text"}],"coordinates":[],"confidence":
{"state":"pre-added","who":["user:1"],"prob":1},"fields":{"f_4":
{"value":"3","confidence":{"state":"pre-added","who":
["user:1"],"prob":1}}},"normalizations":{}},"normalizations":{}}],
"relations":[]}
Each text is given a classId
(e_1
in this case) and a field_value
(f_4
given the value 3
in this case). I'd need to extract it step by step. First extracting the entity with the corresponding text (and adding "none" to where no annotation has been annotated) and in a second step retrieving the field information with the corresponding text.
The corresponding .txt file is just simply like that:
This is the text
I have all .json files in one folder and all .txt in another.