Python CSV reader checking for complete data

Question

I'm trying to create a CSV reader which only includes the data with readings in all columns 6,7 and 8.

My data is of rainfall over days of the year. With my code there is an exception however that some of the data is recorded over a few days. The amount of days the data is recorded over is indicated in row[6], leaving the previous days with blanks in columns 6, 7 and 8 even though they are complete.

So for the reader I need to make a counter which checks firstly whether the data is complete (no blanks) or if it has blanks and is either part of another reading (recorded over a few days) or incomplete (no readings) what i have done so far is shown here:

datalist = []
def read_complete_data():
    ''' Reads the file'''
    filename = input("Enter file name:") #File must be in the same folder as the directory
    with open(filename, 'r') as fileobj:
    #open file for reading
        reader = csv.reader(fileobj, delimiter = ',')
        next(reader)
    tempList = []
    for row in reader:
        if row[5] == "" and row[6] == "" and row[7] == "" :
            tempList.append(row)
    #Checks if the row is complete
        elif row[5] != "" and row[6] != "" and row[7] != "":
            numDay = int(row[6])

    while numDay > 1:
        datalist.append(tempList[1-numDay])
        numDay -= 1

Example of the data:

Product code, Station number, Year, Month, Day, Rainfall, Period, Quality
IDCJAC0009, 70247, 1988, 12, 21, 0, , Y
IDCJAC0009, 70247, 1988, 12, 22, 0, , N
IDCJAC0009, 70247, 1988, 12, 23, 0.2, 1, Y 
IDCJAC0009, 70247, 1988, 12, 24, 0.4, 1, Y
IDCJAC0009, 70247, 1988, 12, 25, , Y
IDCJAC0009, 70247, 1988, 12, 26, 34.8, 2, Y 
IDCJAC0009, 70247, 1988, 12, 27, 30.8, 1, N

As seen above, the first two data samples are incomplete as there is no period of which they are measured over. It can be seen that the data sample on line 5 is incomplete, however the following sample has a period measured of 2 meaning that line 5 is in fact complete it is just measured over a 2 day span rather than a single day. This is an example for 2 day measured but there are larger examples where up to 5 days are grouped into one measurement. The last column is the quality of the data and whether it was a quality check. It needs to be Y to be complete data. As i added row 1 and 2 are still incomplete. Row 7 however is now incomplete.

Output: Basically what i am trying to achieve is the CSV file to be read through and the incomplete data lines to be removed from datalist. Using this temporary list i was trying to make datalist full of only the complete data sets.

Wanted output:

Product code, Station number, Year, Month, Day, Rainfall, Period, Quality
IDCJAC0009, 70247, 1988, 12, 23, 0.2, 1, Y 
IDCJAC0009, 70247, 1988, 12, 24, 0.4, 1, Y
IDCJAC0009, 70247, 1988, 12, 25, , Y
IDCJAC0009, 70247, 1988, 12, 26, 34.8, 2, Y

The next(reader) line is used since the top line of the data contains titles rather than actual data. I'm thinking the problem is arising for me with how ive written the for loop and while loop below using a temporary list which then copies back into the main list (called datalist). There could possibly be a line of code i am missing that is needed for it to work.

I know this is probably a very confusing question and might be tough to answer as the data is not given here but any help with what might be wrong within my code and reading CSV files is greatly appreciated. I thought i would put the question up here even with it being quite confusing to explain. Thanks

Your question would benefit greatly from a representative example of your input and the matching output you want to generate. Having a description in it in English is nice and well, but actually *seeing* it is better. — Tomalak, Oct 16 '17 at 12:06
Thanks for the sample, but none of the rows there has 8 columns. And your code shows that your input is comma-delimited, but your sample is not. The output you want to see for that input is also missing. — Tomalak, Oct 16 '17 at 12:28
Sorry i took out the rows which werent relevent for this part i was trying to find. They included product code and station number. Ill add the final column in which i forgot my bad — Peter Jeppo, Oct 16 '17 at 12:31
I can add the first 2 columns in if it makes it easier to understand. — Peter Jeppo, Oct 16 '17 at 12:35
Please show the actual data... Otherwise `delimiter = ','` is clearly incorrect... Where's the commas? — OneCricketeer, Oct 16 '17 at 12:38
You are *still* missing the expected output for that sample. I've been saying it three times now. If anyone wants to provide a solution, they need plausible input to work with - *something they can copy and paste to their computers* - and a wanted result to compare their solution against. — Tomalak, Oct 16 '17 at 12:39
No, now it's completely broken. Your text still speaks of a column 8, which you do not seem to have, your code speaks of comma-separated input, which you do not seem to have and you **still** have not given a simple list of values you want to produce from your input. Please, focus. — Tomalak, Oct 16 '17 at 12:52
I realize you mean well and have given all kinds of explanations, but natural language is imprecise and things that are perfectly clear to you, who has thought about this problem and seen all the data for quite some time now, are not perfectly clear to an uninvolved 3rd party. Text paragraphs are a great addition to your code and data, but they cannot *replace* your code and data. Also, post your actual data, not something that kind-of looks like it. I need to copy this stuff to my machine and it needs to work exactly the way you describe it, otherwise it's useless. — Tomalak, Oct 16 '17 at 13:04

score 3 · Accepted Answer · edited May 01 '20 at 15:48

CSV readers are generators in Python This means they do not read ahead more than required, which is efficient because it's not necessary to store the entire CSV in memory first.

It would make sense to keep the generator semantics and write a small filter function that wraps the CSV reader and makes changes to it's output on the fly.

The requirements are:

If there are no read-outs (Rainfall,Period) in the current row, hold it back until the next row that has read-outs.
If a row appears to be invalid, do not output it at all.

In the following, filter_rainfall_data is a generator function (generators use yield instead of return) that has a little buffer for requirement #1 and simply does not yield anything for requirement #2.

import csv

def filter_rainfall_data(filename):
    with open(filename, 'r', encoding='UTF-8', newline='') as rows:
        reader = csv.reader(rows, delimiter=',')

        # yield header row
        yield next(reader)
        buffer = []

        for row in reader:
            # strip whitespace from all values
            row = list(map(str.strip, row))

            # any row without read-outs is stored in a buffer
            if row[5] == "" and row[6] == "":
                buffer.append(row)
                continue

            # any row with proper read-outs is yielded
            if row[5] > "" and row[6] > "":
                days = int(row[6])

                # yield all previously buffered rows, if necessary
                yield from buffer[-days - 1:]

                # finally yield the current row itself
                yield row
                buffer = []

Usage would be like this:

filename = input("Enter file name:") 

for row in filter_rainfall_data(filename):
    print(row)

which prints this for me with your sample input

['Productcode', 'Stationnumber', 'Year', 'Month', 'Day', 'Rainfall', 'Period', 'Quality']
['IDCJAC0009', '70247', '1988', '12', '23', '0.2', '1', 'Y']
['IDCJAC0009', '70247', '1988', '12', '24', '0.4', '1', 'Y']
['IDCJAC0009', '70247', '1988', '12', '25', '', '', 'Y']
['IDCJAC0009', '70247', '1988', '12', '26', '34.8', '2', 'Y']
['IDCJAC0009', '70247', '1988', '12', '27', '30.8', '1', 'N']

You can use a CSV writer to turn that into a new CSV file if needed.

Notes:

buffer[-days - 1:] is list a slice of the last N rows from the buffer. When the Period is 5 days, buffer[-days - 1:] would get the last 4 buffered rows.
days = int(row[6]) will fail on non-numeric values.
yield from has been introduced in Python 3.3. See How to Pythonically yield all values from a list?
You should always give the encoding argument when opening text files.
You should always set newline='' when opening CSV files. See the Footnotes in the csv module documentation.

This is pretty much perfect! i might put a few little tweaks but this is what i was after! Thank you so much! — Peter Jeppo, Oct 17 '17 at 06:11
It follows closely the idea behind your code, it would be simple to make your code work with little more than a few changes to indentation. You do all the right things, but at the slightly wrong times. — Tomalak, Oct 17 '17 at 07:50

Python CSV reader checking for complete data

1 Answers1