Getting the filename with on_bad_line

Question

I am trying to get the filename of where the bad line is happening. I have a function for getting all the bad lines and printing it to a .txt file, but when I pass in a parameter for the filename, it just prints all the filenames.

This is the bad line function:

    def badlines_collect(self, bad_line: list[str]) -> None:
        badline_lst.append(bad_line)
        today = date.today()
        todaytime = datetime.datetime.now().strftime("%Y%m%d")
        with open("bad_line1_{}.txt".format(todaytime), 'w') as fp:
            for line in badline_lst:
                fp.write("Today's date: " + str(today) + currentfile + ": {}\n".format(line))
        fp.close()
        print(badline_lst)
        return None

This is the function where I am calling it and passing in a parameter to get the filename:

    def getCSV(self, cur_publisher):
        """
        :return:
        """
        print(bucket_name + '/' + cur_publisher)
        dfm = pd.DataFrame()
        filename = list(self.bucket.list_blobs(prefix=cur_publisher))
        print(filename)
        for file_name in filename:
            if '.csv' in str(file_name.name):
                print("Crawling on File {} ......\n".format(file_name.name))
                currentfile = file_name.name
                print(currentfile)
                blop = self.bucket.blob(blob_name = "{}".format(file_name.name))
                data = blop.download_as_string()
                df = pd.read_csv(io.BytesIO(data), encoding='utf-8', sep=",", engine='python',
                                 on_bad_lines=self.badlines_collect)
                if (df.count().sum()) > 0:
                    df.insert(0, "filename", file_name.name)
                    dfm = pd.concat([dfm, df], ignore_index=True)
                    dfm = pd.concat([dfm, df], ignore_index=True)
                    dfm = dfm.rename_axis(index='', columns="index")
                    print(dfm)
                else:
                    pass
                    print("{} is empty \n".format(file_name.name))
            else:
                pass
        return self.stack

The result I get is all the filenames in the gcs bucket printed into the bad_line1.txt and not the bad line errors

I am not entirely sure how you want this to work. It seems a bit strange that you write the file on each call (in badlines_collect). You could probably use a lambda expression (or local function), where you forward the call to badlines_collect and the current file_name). — rich, Mar 14 '23 at 13:53
Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — Blue Robin, Mar 14 '23 at 13:59

score 0 · Answer 1 · answered Mar 14 '23 at 14:03

I assume the badline_lst is a global variable. I would recommend the following: Use the badlines_collect function only to collect the file_names and bad_lines. Then at the end of getCSV write them all at once

Try this:


def getCSV(self, cur_publisher):
       print(bucket_name + '/' + cur_publisher)
       dfm = pd.DataFrame()
       filename = list(self.bucket.list_blobs(prefix=cur_publisher))
       # maybe you don't need this here?
       badline_lst = []
       print(filename)
            for file_name in filename:
                if '.csv' in str(file_name.name):
                    print("Crawling on File {} ......\n".format(file_name.name))
                    currentfile = file_name.name
                    print(currentfile)
                    blop = self.bucket.blob(blob_name = "{}".format(file_name.name))
                    data = blop.download_as_string()

                    # define the function here, so you have access to the variables...
                    def badlines_collect(self, bad_line: list[str], file_name: str) -> None:
                        badline_lst.append((bad_line, file_name))

                    df = pd.read_csv(io.BytesIO(data), encoding='utf-8', sep=",", engine='python',
                                     on_bad_lines=self.badlines_collect)
                    if (df.count().sum()) > 0:
                        df.insert(0, "filename", file_name.name)
                        #obj = DF(df)
                        dfm = pd.concat([dfm, df], ignore_index=True)
                        #print(df.head(10).to_string())
                        #df = df.rename_axis(index='', columns="index")
                        #df.Date = pd.to_datetime(df.date)
                        # obj = DF(df)
                        dfm = pd.concat([dfm, df], ignore_index=True)
                        dfm = dfm.rename_axis(index='', columns="index")
                        print(dfm)
                        #self.stack.push(obj)
                    else:
                        pass
                        print("{} is empty \n".format(file_name.name))
                else:
                    pass
            today = date.today()
            todaytime = datetime.datetime.now().strftime("%Y%m%d")
            
            with open("bad_line1_{}.txt".format(todaytime), 'w') as fp:
                for line in badline_lst:
                    fp.write("Today's date: " + str(today) + line[1] + ": {}\n".format(line[0]))
            fp.close()
            print(badline_lst)
            return None
            return self.stack

I tried the code and for the filename, its does not print the correct one. Yes, badline_lst is a global variable. — DarrenC, Mar 14 '23 at 15:15
Well, it is hard to debug it since it is not a runnable code fragment. I think the code above shows, in principle, how it can be done. Either you post some minimal runnable code example, or you will have to figure it out on your own ;) — rich, Mar 14 '23 at 21:11

DarrenC · Accepted Answer · 2023-03-23T14:13:20.423

def badlines_collect(self, bad_line: list[str], filename: str) -> None:
    badline_lst.append(bad_line)
    today = date.today()
    todaytime = datetime.datetime.now().strftime("%Y%m%d")
    with open("C:\\badline_log_{}.txt".format(todaytime),
              'w') as fp:
        for line in badline_lst:
            fp.write(filename +" " + "Today's date: " + str(today) + ": {}\n".format(line))
    fp.close()

df = pd.read_csv(io.BytesIO(data), encoding='utf-8', sep=",",engine='python', on_bad_lines=lambda x: self.badlines_collect(x, file_name.name))

I have able to get the filename doing it this way

Getting the filename with on_bad_line

2 Answers2