Convert text file data into a dataframe

Question

I have a dataset like this, a .txt file

للہ عمران خان کو ہماری بھی عمر لگائے ہم جیسے تو اس ملک میں اور بھی پچیس کروڑ ہیں مگر خان آپ جیسا تو یہاں دوسرا نہیں ۔۔۔ اللہ آپکی حفاظت فرمائے آمین

[Real,politics,sarcasm ,rise moral]

how can I convert into data frame into two columns, English text in column one and Urdu text in column two?

Thanks!

It's hard to tell how the English and Urdu text is really laid out in your file and what would be your desired output for your example. Can you clarify? — AKX, Jun 18 '22 at 09:07
Actually, the input and output both are there in the text file. so I have to read the file in CSV format. when reading I want to know how can I mapped into dataframe as input and output. the Urdu text must be mapped to input column while the output that are in square brackets mapped to output column — Ahmad Mahmood, Jun 18 '22 at 09:13
So would lines like that repeat in the input file? `Urdu, English-in-brackets, Urdu, English-in-brackets, Urdu, English-in-brackets, ...`? — AKX, Jun 18 '22 at 09:14
multiple text files each file having data like this. Urdu, English-in-brackets — Ahmad Mahmood, Jun 18 '22 at 09:16

score 1 · Accepted Answer · answered Jun 18 '22 at 09:22

multiple text files each file having data like this. Urdu, English-in-brackets

So start with a function that reads a single file of that type:

def read_single_file(filename: str) -> tuple[str, str]:
    urdu = ""
    english = ""
    with open(filename) as f:
        for line in f:
            line = line.strip()  # remove newlines etc.
            if not line:  # ignore empty lines
                continue
            if line.startswith("["):
                english = line.strip("[]")
            else:
                urdu = line
    return (urdu, english)

Then, loop over your files; I'll assume they're just *.txt:

import glob

results = [read_single_file(filename) for filename in glob.glob("*.txt")]

Now that you have a list of 2-tuples, you can just create a dataframe out of it:

import pandas as pd

df = pd.DataFrame(results, columns=["urdu", "english"])

Convert text file data into a dataframe

1 Answers1