Keeping equations while text cleaning

Question

I have a JSON file like this: {"text": "ABCD (before the photon detections) becomes"}, {"text": "under the additional assumption that optimal distillation will occur for an initial symmetric state through symmetric local PS and squeezing operations, r \u2032 A = r \u2032 B = r \u2032."}, {"text": "In our scheme, we propose to use on-offphoton detectors, as commonly employed in quantum optics experiments. Such a detector is represented by two measurement outcomes: 'off' when no photons are detected and 'on' when one or more photons are detected. Through a successful distillation event, modes C and D are projected onto non-vacuum components and the state of modes A and B is reduced to \u02dc \u03c1 = Tr${CD}$ [\u03c1${ABCD}$I${AB}$ \u2297 \u02c6 \u03a0 (on) C \u2297 \u02c6 \u03a0 (on) D] /P${succ}$, with \u02c6 \u03a0 (on) = $^{I}$\u221e-| 0 \u232a\u2329 0 | = \u2211 \u221e n =1 | n \u232a\u2329 n |. Throughout, we use I$_{m}$ to represent an m-dimensional identity matrix. In order to obtain analytical results, we shall again employ the phase-space formalism. In fact, although the single-mode operator \u02c6 \u03a0 (on) leads to a non-Gaussian Wigner function, W (x) = 1 2 \u03c0-1 \u03c0 exp[x $^{T}$I x] [19], by expressing every single operator through a Wigner function and carrying out the corresponding integrals, we find that the Wigner function of the distilled state is a linear combination of four Gaussian functions:"}

Now, I can use this code to only keep Unicodes, but I want to keep equations as well denoted as $...$, here. Like this: $_{AB}$.

My current code is this:

import os
import json
import re

directory = 'C://Users/Elbek/Downloads/json'

# Define regular expressions for matching equations and unicodes
equation_regex = r'\$\S+?\$'
unicode_regex = r'\\u[0-9a-fA-F]{4}'

# Define a function to clean text
def clean_text(text):
    # Replace equations and unicodes with placeholders
    text = re.sub(equation_regex, lambda x: f'EQUATION_{x.group()}EQUATION', text)
    text = re.sub(unicode_regex, 'UNICODE', text)
    # Remove punctuation and other non-alphanumeric characters
    text = re.sub(r'[^\w\s]', '', text)
    # Lowercase the text
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Loop through all files in the directory
for filename in os.listdir(directory):
    if filename.endswith('.json'):
        # Open the file and load the JSON data
        with open(os.path.join(directory, filename)) as file:
            data = json.load(file)
        # Clean the text in each "text" field
        for item in data:
            item['text'] = clean_text(item['text'])
            # Replace equations and unicodes back with their original form
            item['text'] = re.sub('EQUATION_(.*?)_EQUATION', lambda x: x.group(1), item['text'])
            item['text'] = re.sub('UNICODE', lambda x: re.sub('\s+', ' ', x.group()), item['text'])
        # Write the cleaned data back to the file
        with open(os.path.join(directory, filename), 'w') as file:
            json.dump(data, file)

I would appreciate your help!

I was expecting to clean the text by removing punctuations,lowercasing, etc.

Keeping equations while text cleaning

0 Answers0