Im trying to read big csv files and also effectively work on other stuff at the same time. That is why my solution to this problem is to create a progress bar (something that shows how far Ive come threw out the read that gives me a sense of what time I have before the read is complete). However I have tried using tqdm aswell as ownmade while loops, but to my disfortune, I have not found a solution to this problem. I have tried using this thread: How to see the progress bar of read_csv without no luck. Maybe I can apply TQDM in a different way? Are there any other solutions?
Heres the important part of the code (the one I want to add a progress bar to)
def read_from_csv(filepath: str,
sep: str = ",",
header_line: int = 43,
skip_rows: int = 48) -> pd.DataFrame:
"""Reads a csv file at filepath containing the vehicle trip data and
performs a number of formatting operations
"""
# The first call of read_csv is used to get the column names, which allows
# the typing to take place at the same time as the second read, which is
# faster than forcing type afterwards
df_names: pd.Index[str] = pd.read_csv(
filepath,
sep = sep,
header = header_line,
skip_blank_lines = False,
skipinitialspace = True,
index_col = False,
engine = 'c',
nrows = 0,
encoding = 'iso-8859-1'
).columns
# The "Time" and "Time_abs" columns have some inconsistent
# "Storage group code" preceeding the actual column name, so their
# full column names are stored so they can be renamed later. Also, we want
# to interpret "Time_abs" as a string, while the rest are floats. This is
# stored in a dict to use in the next call to read_csv
time_col = ""
time_abs_col = ""
names_dict = {}
for name in df_names:
if ": Time_abs" in name:
names_dict[name] = 'str'
time_abs_col = name
elif ": Time" in name:
time_col = name
else:
names_dict[name] = 'float'
# A list of values that we want pandas to interpret as having no value.
# "NOVALUE" is the only one of these that's actually used in the files,
# the rest are copy-pasted defaults.
na_vals = ['', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a',
'nan', 'null', 'NOVALUE']
# The whole file is parsed and put in a dataframe
df: pd.DataFrame = pd.read_csv(filepath,
sep = sep,
skiprows = skip_rows,
header = 0,
names = df_names,
skip_blank_lines = False,
skipinitialspace = True,
index_col = False,
engine = 'c',
na_values = na_vals,
dtype = names_dict,
encoding = 'iso-8859-1'
)
# Renames the "Time" and "Time_abs" columns so they don't include the
# storage group part
df.rename(columns = {time_col: "Time", time_abs_col: "Time_abs"},
inplace = True)
# Second retyping of this column (here from string to datetime).
# Very rarely, the Time_abs column in the csv data only has the time and
# not the date, in which case this line throws an error. We manage this by
# simply letting it stay as a string
try:
df[defs.time_abs] = pd.to_datetime(df[defs.time_abs])
except:
pass
# Every row ends with an extra delimiter which python interprets as another
# column, but it's empty so we remove it. This is not really necessary, but
# is done to reduce confusion when debugging
df.drop(df.columns[-1], axis=1, inplace=True)
# Adding extra columns to the dataframe used later
df[defs.lowest_gear] = np.nan
df[defs.lowest_speed] = np.nan
for i in list(defs.second_trailer_axles_dict.values()):
df[i] = np.nan
return df
Its the reading csv that takes a lot of time thats why that is the point of interest to add a progress bar to.
Thank you in advance!