1

I am currently try to have the data like this (The ... just means there are more lines, no need to post the entire file here.)

376 932
noms sommets
0000 Abbesses
0001 Alexandre Dumas
0002 Alma Marceau
...
0375 Étienne Marcel
coord sommets
0000 308 536
0001 472 386
0002 193 404
...
0375 347 412
arcs values
0 238 41
0 159 46
1 12 36
1 235 44
...
367 366 120.0

The data should be like this when converted to csv, the data should has three columns

nom sommets coord sommets
0000 Abbesses 308 536

However, everything in the data is a straight line and hard to deal with. What is the solution for this. I try to convert it from txt to csv.

BdR
  • 2,770
  • 2
  • 17
  • 36
Beam291
  • 98
  • 7

3 Answers3

1
from pathlib import Path

import pandas as pd

f = Path("metro")

lines = [[], [], []]
file_num = -1

for line in f.read_text().split("\n"):
    if not line:
        continue
    cells = line.split(maxsplit=1)
    if cells[0] in ["noms", "coord", "arcs"]:
        file_num += 1
    if file_num >= 0:
        lines[file_num].append(cells)


def get_df(data):
    df1 = pd.DataFrame(data)
    df1.columns = df1.iloc[0]
    df1 = df1.drop(index=0)
    df1.columns.name = None
    return df1


df1 = get_df(lines[0])
df2 = get_df(lines[1])
df3 = get_df(lines[2])

df2.columns = [df1.columns[0], " ".join(df2.columns)]

res = pd.merge(df1, df2, how="outer", on="noms")
#    noms          sommets coord sommets
# 0  0000         Abbesses       308 536
# 1  0001  Alexandre Dumas       472 386
# 2  0002         NaN       193 404
res.to_csv("metro.csv")

Edit: to resolve the encoding issue pass the encoding you want to read_text().

for line in f.read_text(encoding="latin-1").split("\n"):
    ...

Edit: you don't say how you want to process the columns under "arcs values", so I left the df3 as is.

suvayu
  • 4,271
  • 2
  • 29
  • 35
  • what happen if the data is not utf-8, 'utf-8' codec can't decode byte 0xe9 in position 81: invalid continuation byte – Beam291 Nov 22 '22 at 15:21
  • Which line gives you this issue? The Pandas part should be independent. Difficult to say without looking at the full error. – suvayu Nov 22 '22 at 15:24
  • the problem start from 'for line in f.read_text().split("\n"):'. I think the problem come from data, normally I will encoding it to 'latin-1' – Beam291 Nov 22 '22 at 15:26
  • one workaround could be to read it as bytes: `f.read_bytes()`, but then you have to replace all the `str` in the for loop with bytes, e.g. `b'\n'` etc. But before you create the dataframe, you need to convert the parsed results from `bytes` to `str`. You should figure out why you have this kind of character in the data. – suvayu Nov 22 '22 at 15:31
  • for data sample: https://drive.google.com/file/d/18QGnzcJ4PYaVY_qnFZ1cwVlCNZc1zUSt/view?usp=sharing – Beam291 Nov 22 '22 at 15:49
  • @Beam291 note, your file has an extra line in the beginning which does not match your description in the question. So you would need to remove this line first. – suvayu Nov 22 '22 at 16:30
  • for the df3, I think I will deal with that base on your help – Beam291 Nov 23 '22 at 03:59
1

without imports you can do this.

There's some safety checks due to the noise in the data.

Also, I'm using a dict as they are extremely fast when trying to find key/value pairs.

with open("metro", encoding="latin-1") as infile:
    data = infile.read().splitlines()

nom_start = "noms sommets"
coord_start = "coord sommets"
end = "arcs values"
mode = None

# use a dict as lookups on dicts are stupidly fast.
result = {}

for line in data:
    # this one is needed due to the first line
    if mode == None:
        if line == nom_start:
            mode = nom_start
        continue
    line = line.strip()
    # safety check
    if line != "":
        if line == end:
            # skip the end data
            break
        key, value = line.split(maxsplit=1)
        if mode == nom_start:
            if line != coord_start:
                result[key] = {"sommets": value}
            else:
                mode = coord_start
        else:
            result[key]["coord sommets"] = value


# CSV separator
SEP = ";"
with open("output.csv", "w", encoding="latin-1") as outfile:
    # CSV header
    outfile.write(f"noms{SEP}sommets{SEP}coord sommets\n")
    for key, val in result.items():
        outfile.write(f'{key}{SEP}{val["sommets"]}{SEP}{val["coord sommets"]}\n')
Edo Akse
  • 4,051
  • 2
  • 10
  • 21
1

Quite an interesting problem. I'm assuming the file contains more columns, or sets of key/variables, than just in the example. So you wouldn't want to hard-code the column names.

I would create an new empty dataframe, then read the input file line-by-line, check if it is the next new column name (not starting with digits), build a dictionary with those new values, and then keep merging that dictionary as a new columns into the new dataframe.

So I would do something like this:

import pandas as pd

# create an Empty DataFrame object
df_new = pd.DataFrame({"recordkey": []})

# read all input lines
inputfilename = "inputfile.txt"
file1 = open(inputfilename, 'r')
Lines = file1.readlines()

tmpdict = {}
colname = ""

# iterate through all lines
for idx in range(len(Lines)):
    line = Lines[idx]
    # this is assuming all keys are exactly 4 digits
    iscolname = not (line[:4].isdigit())
    
    if not iscolname:
        # split on the first space for key and value
        tmp = line.split(" ", 1)
        getkey = tmp[0].strip()
        getvalue = tmp[1].strip()

        # add to dictionary
        tmpdict[getkey] = getvalue

    # new column or last line
    if iscolname or idx == len(Lines)-1:
        # new column (except skip for first line of file)
        if colname != "":
            # create new column from dictionary
            df_tmp = pd.DataFrame(tmpdict.items(), columns=["recordkey", colname])
            df_new = df_new.merge(df_tmp, how='outer', on='recordkey')

        # keep new column name
        colname = line.strip()
        tmpdict = {}

# display dataframe
print(df_new)

# write dataframe to csv
fileoutput = "outputfile.csv"
df_new.to_csv(fileoutput, sep=",", index=False)
BdR
  • 2,770
  • 2
  • 17
  • 36
  • yes, the file will be separated into two tables with 2 - 3 columns. The problem is the data contains everything in a straight line – Beam291 Nov 23 '22 at 00:16