I have a JSON file like this: {"text": "ABCD (before the photon detections) becomes"}, {"text": "under the additional assumption that optimal distillation will occur for an initial symmetric state through symmetric local PS and squeezing operations, r \u2032 A = r \u2032 B = r \u2032."}, {"text": "In our scheme, we propose to use on-offphoton detectors, as commonly employed in quantum optics experiments. Such a detector is represented by two measurement outcomes: 'off' when no photons are detected and 'on' when one or more photons are detected. Through a successful distillation event, modes C and D are projected onto non-vacuum components and the state of modes A and B is reduced to \u02dc \u03c1 = Tr${CD}$ [\u03c1${ABCD}$I${AB}$ \u2297 \u02c6 \u03a0 (on) C \u2297 \u02c6 \u03a0 (on) D] /P${succ}$, with \u02c6 \u03a0 (on) = $^{I}$\u221e-| 0 \u232a\u2329 0 | = \u2211 \u221e n =1 | n \u232a\u2329 n |. Throughout, we use I$_{m}$ to represent an m-dimensional identity matrix. In order to obtain analytical results, we shall again employ the phase-space formalism. In fact, although the single-mode operator \u02c6 \u03a0 (on) leads to a non-Gaussian Wigner function, W (x) = 1 2 \u03c0-1 \u03c0 exp[x $^{T}$I x] [19], by expressing every single operator through a Wigner function and carrying out the corresponding integrals, we find that the Wigner function of the distilled state is a linear combination of four Gaussian functions:"}
Now, I can use this code to only keep Unicodes, but I want to keep equations as well denoted as $...$, here. Like this: $_{AB}$.
My current code is this:
import os
import json
import re
directory = 'C://Users/Elbek/Downloads/json'
# Define regular expressions for matching equations and unicodes
equation_regex = r'\$\S+?\$'
unicode_regex = r'\\u[0-9a-fA-F]{4}'
# Define a function to clean text
def clean_text(text):
# Replace equations and unicodes with placeholders
text = re.sub(equation_regex, lambda x: f'EQUATION_{x.group()}EQUATION', text)
text = re.sub(unicode_regex, 'UNICODE', text)
# Remove punctuation and other non-alphanumeric characters
text = re.sub(r'[^\w\s]', '', text)
# Lowercase the text
text = text.lower()
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text)
return text.strip()
# Loop through all files in the directory
for filename in os.listdir(directory):
if filename.endswith('.json'):
# Open the file and load the JSON data
with open(os.path.join(directory, filename)) as file:
data = json.load(file)
# Clean the text in each "text" field
for item in data:
item['text'] = clean_text(item['text'])
# Replace equations and unicodes back with their original form
item['text'] = re.sub('EQUATION_(.*?)_EQUATION', lambda x: x.group(1), item['text'])
item['text'] = re.sub('UNICODE', lambda x: re.sub('\s+', ' ', x.group()), item['text'])
# Write the cleaned data back to the file
with open(os.path.join(directory, filename), 'w') as file:
json.dump(data, file)
I would appreciate your help!
I was expecting to clean the text by removing punctuations,lowercasing, etc.