2

I have a txt file that I read into a list of strings in which each item of the list is a data sample of 3 variables (A,B,C)

txt = [
     '001   0198110',
     '0020130198110',
     '0030132198110',
]

A separate support dataframe that looks like this

df = pd.DataFrame(data=[[1,3,"A"],[4,3,"B"],[7,6,"C"]],columns=["Position","Lenght","Name"])

df
       Position   Lenght    Name
    0   1         3         A
    1   4         3         B
    2   7         6         C

indicates how I should read the txt. For example the variable A starts at position 1 of each record and its length is 3.

Taking as an example only the first record

'001   0198110'

we have that A=001 B=empty->NaN C=0198110

What would be an efficient way (considering that the txt file is 50k lines and contains +600 variables) to create the expected parsed dataframe

    A   B       C
0   1   NaN     198110
1   2   13      198110
2   3   13     2198110

I can use a for loop, but perhaps there is a better way

gabboshow
  • 5,359
  • 12
  • 48
  • 98

2 Answers2

1

Try pd.read_fwf:

from io import StringIO

txt = ["001    198110", "0020130198110", "0030132198110"]

df = pd.DataFrame(
    data=[[1, 3, "A"], [4, 4, "B"], [7, 6, "C"]],
    columns=["Position", "Lenght", "Name"],
)


x = pd.read_fwf(
    StringIO("\n".join(txt)),
    widths=df.Lenght,
    header=None,
)
x.columns = df.Name.to_list()
print(x)

Prints:

   A      B       C
0  1    NaN  198110
1  2  130.0  198110
2  3  132.0  198110

Note: I changes the txt list/df dataframe:

  • First value of txt to "001 198110"
  • Length of B to 4 in df

Prints:

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    great! thanks! you also spotted an error in my expected output... I corrected it in the question, but your answer is correct – gabboshow Nov 08 '22 at 12:45
0

It's hard to recommend optimizations without knowing what the data usually looks like. My first two approaches would be those:

If there are a lot of similarities like C, you could look into running a diff first, so you know the ranges where you can copy without rechecking over all the files.

If there is a lot of overlap between variables, you could sort them by starting position and write multiple at the same time, which should also be easier on memory than random accesses

Lukas Schmid
  • 1,895
  • 1
  • 6
  • 18