1

I have txt file that contians two columns (filename and text) the spreater during generating txt file is tab example of input file below :

text.txt

23.jpg   még
24.jpg   több

the expacted output_file.jsonl type json line format

{"file_name": "23.jpg", "text": "még"}
{"file_name": "24.jpg", "text": "több"}

But I got issue with uincode or encoding format :

{"file_name": "23.jpg", "text": "m\u00c3\u00a9g"}
{"file_name": "24.jpg", "text": "t\u00c3\u00b6bb"}

it seems that dosent recognize hungarain spicial charchters áéíöóőüúüű for both small and captial case

for example in resulting *.jsonl file it gives assci or differnt encoding \u00c3\u00a9 code instead of the letter é

I wrote this small sript to convert *.txt file in Hungarain languge to *.jsonl in Hungarain too

import pandas pd 
train_text = 'text.txt'
df = pd.read_csv(f'{train_text}' ,header=None,delimiter='   ',encoding="utf8") # delimiter tab here 
df.rename(columns={0: "file_name", 1: "text"}, inplace=True)

# convert txt file to jsonlines 
reddit = df.to_dict(orient= "records")
import json 
with open("output_file.jsonl","w") as f:
   for line in reddit:
     f.write(json.dumps(line) + "\n")

My expactation output_file.jsonl type json line format

{"file_name": "23.jpg", "text": "még"}
{"file_name": "24.jpg", "text": "több"}
Mohammed
  • 346
  • 1
  • 12

2 Answers2

3

From the docs, json.dump includes a ensure_ascii flag. Its always puzzled me why its True by default, but that's the thing that inserts unicode escape sequences instead of using multibyte utf-8 encodings. It should be fine... other parsers should figure it out. But to fix the problem do,

f.write(json.dumps(line, ensure_ascii=False) + "\n")
tdelaney
  • 73,364
  • 6
  • 83
  • 116
2

The answer https://stackoverflow.com/a/75594200/218663 by @tdelaney is great and I upvoted it. If you wanted to bypass the step of casting to a dictionary you could also do:

import pandas

df = pandas.DataFrame([
    {"file_name": "23.jpg", "text": "még"},
    {"file_name": "24.jpg", "text": "több"}
])

with open("out.json", "w", encoding="utf-8") as file_out:
    df.to_json(file_out, orient= "records", force_ascii=False)

It is essentially the same "force_ascii" issue.

JonSG
  • 10,542
  • 2
  • 25
  • 36
  • 1
    Thanks, @tdelaney jonsg this also works. but in case reading the big text files with GB size with pandas it seems to miss some values. I don't know why like filename (111.jpg) it saves as (111. p) meanwhile, I have opened another issue for virtualization can you please take a look? – Mohammed Feb 28 '23 at 16:27
  • 4
    This is a great solution but should note that the original code wrote a bunch of small newline-delimited JSON strings (dicts in this case) whereas this solution writes a single JSON string with an outer list holding the dicts. Whatever consumes this file in the future needs to know the difference. – tdelaney Feb 28 '23 at 18:34
  • 2
    @MahmmoudAbedSuleiman - JSON encoders / decoders shouldn't miss anything. If you could narrow this down to a small example where it happens - something that we could test - that would be interesting. – tdelaney Feb 28 '23 at 18:36