0

I am trying to generate some json data from txt files.

The txt files are generated from books, using their ocr, which makes them inestimable (i can't randomly change the chars i don't like, since they could be important) and unreliable (the ocr could have gone wrong, the author could have inserted symbols that would mess with my code).

As of now, i have this :

output_folder = Path(output_folder)
    
value = json.loads('{"nome": "' + file_name[:len(file_name)-4] + '", "testu": "' + (Path(filename).read_text()) + '"}')
    path = output_folder / (file_name[:len(file_name)-4] + "_opare.json")
    with path.open(mode="w+") as working_file:
        working_file.write("[" + str(value) + "]")
        working_file.close()

This throws me the error json.decoder.JSONDecodeError: Invalid control character which i understood is caused by my book starting (yes) with a ' (a quote).

I've read about string literals, that seem to be relevant for my case, but i didn't uderstood how i could use them.

What can i do ?

Thanks

Orsu
  • 405
  • 6
  • 19
  • the probably worst thing would be reading word by word and using try except and pretty much throwin out those words that were excepted but that would certainly work I think – Matiiss Mar 31 '21 at 22:29
  • 1
    There are a lot of things that may or may not be a problem here. It would help if we could be sure of the *exact* content of your source file (at a binary level, not just what you think the text is). It's important to make sure you know the encoding of the file. That said, you should not try to build JSON data this way (work the other way around, as in @LuizFerraz's answer). As for "string literals", I think you are confused as to what that means. All a string literal is, is a string that appears "literal"ly in your code. For example, `'{"nome": "'`, or `"["`. – Karl Knechtel Mar 31 '21 at 22:57
  • yeah, the difficulty here is that i don't know what the books will be (this is the batch processing phase). I followed @Luis Ferrza's answer, which is correct. For the literals, i hoped they would help me escape all problematic chars, but json.dump() already does it – Orsu Apr 01 '21 at 06:10

2 Answers2

1

Why would you make a json just to parse it again? You can just create a dictionary:

value = {
  "nome": file_name[:len(file_name)-4],
  "testu":Path(filename).read_text(),
}
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
Luiz Ferraz
  • 1,427
  • 8
  • 13
  • you are absolutely right, why did i do that ? it's not exactly the way i was going, but you are 100% correct. thank you – Orsu Mar 31 '21 at 22:36
0

Reading between the lines, the JSONDecodeError doesn't actually come from this code, does it? It comes from the code that's reading your file later.

You can't write a dict to a JSON file using str(value). Python's dict-to-string conversion uses single quotes, which is not legal in JSON. You need to convert it back to JSON:

    with path.open(mode="w+") as working_file:
        json.dump( [value], working_file )
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30