-1

I've been struggling on a problem for a few hours.

I've a data-set which looks like this :

References                                                         425 1451  259
/1/J.H.Gibbons,R.L.Macklin:Phys.Rev.B137,1508(1965)                425 1451  260
/2/P.R.Wrean,C.R.Brune,R.W.Kavanagh:Phys.Rev.C49,1205(1994)        425 1451  261
/3/L.Van der Zwan,K.W.Geiger:Nucl.Phys.A152,481(1970)              425 1451  262
/4/T.Murata:JAERI-Conf 98-003,p.215                                425 1451  263
 3.500000+6 3.844649-2 3.508375+6 3.783472-2 4.000000+6-2.064883-2 425 3  2    7
 4.014688+6-2.289045-2 4.403160+6-8.623264-2 4.500000+6-1.022847-1 425 3  2    8
 1.450000+7-2.039133-1 1.500000+7-1.930321-1                       425 3  2   17

and I wonder how I could read and have in an array or a txt object only the last rows where they're values in each columns. To be sure I would like to have an object like this :

 3.500000+6 3.844649-2 3.508375+6 3.783472-2 4.000000+6-2.064883-2 425 3  2    7
 4.014688+6-2.289045-2 4.403160+6-8.623264-2 4.500000+6-1.022847-1 425 3  2    8
 1.450000+7-2.039133-1 1.500000+7-1.930321-1                       425 3  2   17

Sincerely I've found no such thing on StackOverflow so I ask the question directly.

Any answers would be nice. Thank you ! hh

hhvni
  • 1
  • 1
  • Is the input a text file? Are you trying to extract lines with numeric values only? What did you try so far? – Ken Hung Jul 12 '22 at 12:17
  • Yep the input is a text file. I just wanna get the lines where all the columns values are numeric, they're all the time the last rows (the number depends on the file). To be honest I haven't tried much except using the `loadtxt` function from numpy. – hhvni Jul 12 '22 at 12:21
  • So what is the expected output? A list of strings? – Ken Hung Jul 13 '22 at 13:21

2 Answers2

0

You'll need two components: 1.) A file reader How to read files in python 2.) A filter. I'd recommend regex for this regex in python

Loop through each line using your file reader. Use your filter to look for and extract data on each line.

This example may work if 425 will be at the start of all your data:

import re

data = []
# Allows you to read from your file
with open('your.txt') as f:

    # Will loop through each line
    for line in f.readlines():

        # Looks for 425 and pulls till end of line:
        data_point = re.search('425.*', line).group()
        data.append( data_point )
0

The table columns appear to have fixed widths.

Without having more of the table, it is hard to reliably determine if a row is meaningful or not. I've dropped rows if the value in the first column doesn't appear to be a number like 3.500000+6 by looking for . at the exact position.

The numbers in the first 6 columns seem to be in scientific notation but without the e. Code fixing that is included at the bottom.

import pandas as pd

filename = r"C:\Users\Bobson Dugnutt\Desktop\table.txt"

column_widths = (11, 11, 11, 11, 11, 11, 4, 2, 3, 5)

df = pd.read_fwf(filename, widths=column_widths, header=None)

print(df, end="\n\n")

# Returns True if it looks like a number like "3.500000+6"
def is_meaningful_row(value):
    #print(value)
    if isinstance(value, str):
        if value[-9] == ".":
            return True
    
    return False

# Remove rows if the first column doesn't look like a number like "3.500000+6"
df = df[df[0].map(is_meaningful_row)]

# Reset the index so the first row is no longer index 5
df = df.reset_index(drop=True)

print(df, end="\n\n")

# Replace NA/NaN values with empty strings
df = df.fillna("")

print(df, end="\n\n")

# Transforms a string like "-2.039133-1" to "-2.039133e-1"
def fix_scientific_notation(value):
    if isinstance(value, str):
        return value[0:1] + value[1:].replace("+", "e+").replace("-", "e-")
    
    return value

# For the first 6 columns, fixes the strange scientific notation and converts 
# values from string to float64 or int64
for column in df.columns[:6]:
    df[column] = df[column].map(fix_scientific_notation)
    df[column] = pd.to_numeric(df[column])

print(df)

Output:

             0            1            2            3            4            5    6  7    8    9
0   References          NaN          NaN          NaN          NaN          NaN  425  1  451  259
1  /1/J.H.Gibb  ons,R.L.Mac  klin:Phys.R  ev.B137,150      8(1965)          NaN  425  1  451  260
2  /2/P.R.Wrea  n,C.R.Brune  ,R.W.Kavana  gh:Phys.Rev  .C49,1205(1         994)  425  1  451  261
3  /3/L.Van de  r Zwan,K.W.  Geiger:Nucl  .Phys.A152,    481(1970)          NaN  425  1  451  262
4  /4/T.Murata  :JAERI-Conf   98-003,p.2           15          NaN          NaN  425  1  451  263
5   3.500000+6   3.844649-2   3.508375+6   3.783472-2   4.000000+6  -2.064883-2  425  3    2    7
6   4.014688+6  -2.289045-2   4.403160+6  -8.623264-2   4.500000+6  -1.022847-1  425  3    2    8
7   1.450000+7  -2.039133-1   1.500000+7  -1.930321-1          NaN          NaN  425  3    2   17

            0            1           2            3           4            5    6  7  8   9
0  3.500000+6   3.844649-2  3.508375+6   3.783472-2  4.000000+6  -2.064883-2  425  3  2   7
1  4.014688+6  -2.289045-2  4.403160+6  -8.623264-2  4.500000+6  -1.022847-1  425  3  2   8
2  1.450000+7  -2.039133-1  1.500000+7  -1.930321-1         NaN          NaN  425  3  2  17

            0            1           2            3           4            5    6  7  8   9
0  3.500000+6   3.844649-2  3.508375+6   3.783472-2  4.000000+6  -2.064883-2  425  3  2   7
1  4.014688+6  -2.289045-2  4.403160+6  -8.623264-2  4.500000+6  -1.022847-1  425  3  2   8
2  1.450000+7  -2.039133-1  1.500000+7  -1.930321-1                           425  3  2  17

            0         1           2         3          4         5    6  7  8   9
0   3500000.0  0.038446   3508375.0  0.037835  4000000.0 -0.020649  425  3  2   7
1   4014688.0 -0.022890   4403160.0 -0.086233  4500000.0 -0.102285  425  3  2   8
2  14500000.0 -0.203913  15000000.0 -0.193032        NaN       NaN  425  3  2  17
GordonAitchJay
  • 4,640
  • 1
  • 14
  • 16