1

I have an assignment to export neat CSV files where only the headers and data are present, all other data must be filtered out. There are about 500+ text files.

Each file must be a separate CSV file, the format must be "YEAR-MONTH-DAY (ORIGINAL_FILE_NAME)".

An example of this is: Original file: pm990902.b17

CSV file: 1999-09-02 (pm990902.b17).csv

I already have code for filtering the data:

*

import pandas as pd
import numpy as np
import glob
pred = lambda x: x  in np.arange(0, 192, 1)
inval = [99999.9, 999.0, 999.9900, 999.9]
files = glob.glob('C:\\Users\Lenovo\Desktop\Python\Files\*')
for file in files:
    
    df = pd.read_csv(file, header = 0, delim_whitespace=True, skiprows=pred, 
                 engine='python', na_values=inval)
    
    df = df[1:]
    df.to_csv('Name of the new file.csv', index=False)

I still can't figure out how to do the new name of the file (the date) which is actually the problem for me.

This is what the file looks like with the date in the first line:

*AAAAAAAAAAAAAAAAAAAAAAAAAA          zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz       05-JAN-2000 12:21:0005-JAN-2000 14:00:300102
160  2160
           1.00     1.0   1.00   1.00  1.0000   1.0   1.0     1.0    1.0000    1.0000   1.00  1.000   1.0   1.0  1.0000  1.0000
        9999.90 99999.0 999.90 999.00 99.9900 999.0 999.9 99999.9  999.9900  999.9900 999.90 99.990 999.9 999.9 99.9900 99.9900
Pressure [hPa]
Geopotential height [gpm]
Temperature [K]
Relative humidity [%]
Ozone partial pressure [mPa]
Horizontal wind direction [decimal degrees]
Horizontal wind speed [m/s]
GPS geometric height [m]
GPS longitude [decimal degrees E]
GPS latitude [decimal degrees N]
Internal temperature [K]
Ozone raw current [microA]
Battery voltage [V]
Pump current [mA]
Ozone mixing ratio per volume [ppm]
Ozone partial pressure uncertainty estimate [mPa]*

I can't attach the whole text file, but this is an example of the beginning of every text file.

So how can I get the desired date for the file name out of this line?

Shradda
  • 21
  • 2
  • 1
    Please post text as text, not as a picture of text. If other people can copy and paste your data and your code, they can easily reproduce what you're trying to do. If you post a screenshot of it, they have to transcribe it by hand, and nobody wants to spend time doing that. – Samwise May 22 '22 at 21:27
  • 1
    [Why you should not upload images of code or data](https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors-when-asking-a-question) – Grismar May 22 '22 at 22:41
  • I'm sorry I tried to attach the text file to the question but I couldn't do it. I hope this works. Thank you for correcting me. – Shradda May 22 '22 at 22:49

1 Answers1

0

If the input files always have the same format, with the date/time elements always at the end of the line, you can split the line, and just take the third element from the end.

You can do this with negative indexing, as per w3schools

line = "*AAAAAAAAAAAAAAAAAAAAAAAAAA          zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz       05-JAN-2000 12:21:0005-JAN-2000 14:00:300102"

# default split splits on the whitepace character
date_str = line.split()[-3]
print(date_str)

output

05-JAN-2000

As for applying this to your logic, you'll need to change the line below to my code example further down:

    df.to_csv('Name of the new file.csv', index=False)

You need to import os as I use os.path and os.sep to get the resulting filename.

    filename_orig = os.path.basename(file)
    filedir = os.path.dirname(file)
    df.to_csv(f"{filedir}{os.sep}{date_str} ({filename_orig}).csv)", index=False)

Note that this requires Python 3.6+ as I'm using f-strings.
Also note that you need to open the original files and actually read the first line of the file. This will work.

Edo Akse
  • 4,051
  • 2
  • 10
  • 21