0

Im trying to read big csv files and also effectively work on other stuff at the same time. That is why my solution to this problem is to create a progress bar (something that shows how far Ive come threw out the read that gives me a sense of what time I have before the read is complete). However I have tried using tqdm aswell as ownmade while loops, but to my disfortune, I have not found a solution to this problem. I have tried using this thread: How to see the progress bar of read_csv without no luck. Maybe I can apply TQDM in a different way? Are there any other solutions?

Heres the important part of the code (the one I want to add a progress bar to)

def read_from_csv(filepath: str,
                  sep: str = ",",
                  header_line: int = 43,
                  skip_rows: int = 48) -> pd.DataFrame:
    """Reads a csv file at filepath containing the vehicle trip data and 
    performs a number of formatting operations
    """
    
    # The first call of read_csv is used to get the column names, which allows
    # the typing to take place at the same time as the second read, which is
    # faster than forcing type afterwards
    df_names: pd.Index[str] = pd.read_csv(
                            filepath,
                            sep = sep,
                            header = header_line,
                            skip_blank_lines = False,
                            skipinitialspace = True,
                            index_col = False,
                            engine = 'c',
                            nrows = 0,
                            encoding = 'iso-8859-1'
    ).columns

    # The "Time" and "Time_abs" columns have some inconsistent 
    # "Storage group code" preceeding the actual column name, so their 
    # full column names are stored so they can be renamed later. Also, we want
    # to interpret "Time_abs" as a string, while the rest are floats. This is
    # stored in a dict to use in the next call to read_csv
    
    time_col =      ""
    time_abs_col =  ""
    names_dict = {}
    for name in df_names:
        if ": Time_abs" in name:
            names_dict[name] = 'str'
            time_abs_col = name
        elif ": Time" in name:
            time_col = name
        else:
            names_dict[name] = 'float'


    
    # A list of values that we want pandas to interpret as having no value.
    # "NOVALUE" is the only one of these that's actually used in the files,
    # the rest are copy-pasted defaults.
    na_vals = ['', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
               '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a',
               'nan', 'null', 'NOVALUE']

    # The whole file is parsed and put in a dataframe
    df: pd.DataFrame = pd.read_csv(filepath,
                                   sep = sep,
                                   skiprows = skip_rows,
                                   header = 0,
                                   names = df_names,
                                   skip_blank_lines = False,
                                   skipinitialspace = True,
                                   index_col = False,
                                   engine = 'c',
                                   na_values = na_vals,
                                   dtype = names_dict,
                                   encoding = 'iso-8859-1'
                                   )

    
    # Renames the "Time" and "Time_abs" columns so they don't include the 
    # storage group part
    df.rename(columns = {time_col: "Time", time_abs_col: "Time_abs"}, 
              inplace = True)

    # Second retyping of this column (here from string to datetime).
    # Very rarely, the Time_abs column in the csv data only has the time and 
    # not the date, in which case this line throws an error. We manage this by
    # simply letting it stay as a string
    try:
        df[defs.time_abs] = pd.to_datetime(df[defs.time_abs])
    except:
        pass

    
    # Every row ends with an extra delimiter which python interprets as another
    # column, but it's empty so we remove it. This is not really necessary, but
    # is done to reduce confusion when debugging
    df.drop(df.columns[-1], axis=1, inplace=True)

    # Adding extra columns to the dataframe used later
    df[defs.lowest_gear] = np.nan
    df[defs.lowest_speed] = np.nan
    for i in list(defs.second_trailer_axles_dict.values()):
        df[i] = np.nan

    return df

Its the reading csv that takes a lot of time thats why that is the point of interest to add a progress bar to.

Thank you in advance!

3 Answers3

1

You can easily do this with Dask. For example:

import dask.dataframe as dd
from dask.diagnostics import ProgressBar

ddf = dd.read_csv(path, blocksize=1e+6)

with ProgressBar():
    df = ddf.compute()
[########################################] | 100% Completed | 37.0s

And you will see the file download process. the blocksize parameter is responsible for the blocks that your file is read with. By changing it, you can achieve good performance. Plus, Dask uses several threads for reading by default, which will speed up the reading process itself.

padu
  • 689
  • 4
  • 10
  • Thank you for the reply! However, im getting this: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead – Leopold Wahlbeck Sep 07 '21 at 12:32
  • Can't you follow this recommendation? – padu Sep 07 '21 at 13:58
  • Doesnt seem the work for me, I get the error stated above. And when following it I get another error. Ill write it in the next comment. – Leopold Wahlbeck Sep 07 '21 at 13:59
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 9700: invalid start byte – Leopold Wahlbeck Sep 07 '21 at 14:06
  • Did you specify the encoding `'iso-8859-1'`as in your example? – padu Sep 07 '21 at 17:28
  • If you mean in: `ddf = dd.read_csv(filepath, blocksize=1e+6)` No, am I supposed to? – Leopold Wahlbeck Sep 08 '21 at 06:39
  • Yes, ofcourse. I wrote this for an example. Specify all the metadata that you need. – padu Sep 08 '21 at 07:30
  • Seems to have fixed the problem, however (sorry), I have another problem. Now I get the error: `TypeError: '>' not supported between instances of 'str' and 'float'` It occurs after `with ProgressBar(): df = ddf.compute()` – Leopold Wahlbeck Sep 08 '21 at 08:49
  • Are there different types of data in your columns? If you know the data type in each column, try specifying them. For example: `dd.read_csv(path, meta, dtype={'column_1': 'float', 'column_2': 'category', 'column_3': 'int16'}) etc` where `column_*` should be name of you columns – padu Sep 08 '21 at 09:17
  • The columns are separated by "," and not cells. Maybe that could cause a problem? – Leopold Wahlbeck Sep 08 '21 at 11:03
  • I have only worked with xlsx before so im not familiar by this. – Leopold Wahlbeck Sep 08 '21 at 11:21
  • Ok. I understood. If you want, you can send me a piece of the document on geometryk@gmail.com. I'll try to help you – padu Sep 08 '21 at 11:25
  • I appreciate your dedication but I cant send the document. However, there is around 100 comma separated values, so I cant really specify the data type for all columns. – Leopold Wahlbeck Sep 08 '21 at 11:48
  • Hey! I just wanted to let you know I fixed the problem, thank you for the help! – Leopold Wahlbeck Sep 09 '21 at 09:05
  • Just one more thing... Is it possible to add a dask progress bar to this: `# Second retyping of this column (here from string to datetime). # Very rarely, the Time_abs column in the csv data only has the time and # not the date, in which case this line throws an error. We manage this by # simply letting it stay as a string try: df[defs.time_abs] = pd.to_datetime(df[defs.time_abs]) print("Done: Second time retyping CSV.") except: pass` – Leopold Wahlbeck Sep 09 '21 at 09:06
  • Of course, if `df` is a dask dataframe. If it is a pandas you can easily convert it to a dask dataframe: `dask_df = dd.from_pandas(df, npartitions=8)` and then you can call `with ProgressBar(): date_col=dd.to_datetime(dask_df[time_abs])` – padu Sep 09 '21 at 10:13
0

You can use tqdm.

Somewhere in your function:

def read_from_csv(filepath: str,
                  sep: str = ",",
                  header_line: int = 43,
                  skip_rows: int = 48,
                  chunksize: int = 10000) -> pd.DataFrame:

    # Count the total lines of the file
    # Overhead: 3.73s for 10,000,000 lines / 4.2G on a SSD
    length = sum(1 for row in open('large.csv', 'r'))

    data = []
    with tqdm(total=1 + (length // chunksize)) as pbar:
        # Replace your 2nd pd.read_csv by this:
        for chunk in pd.read_csv('large.csv', ..., chunksize=chunksize):
            data.append(chunk)
            pbar.update(chunksize)
    df = pd.concat(data)
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • What happends here? ` # Replace your 2nd pd.read_csv by this: for chunk in pd.read_csv('large.csv', ..., chunksize=chunksize): data.append(chunk) pbar.update(chunksize) df = pd.concat(data) ` – Leopold Wahlbeck Sep 07 '21 at 13:08
  • Not exactly replace but enclose your second call to to pd.read_csv into this code. The code above split the csv file into chunks instead of read entirely in one time. At the end concat, merge all chunk to get your original dataframe. – Corralien Sep 07 '21 at 13:23
  • If you mean replacing the code below, I get an error. `df: pd.DataFrame = pd.read_csv(filepath, sep = sep, skiprows = skip_rows, header = 0, names = df_names, skip_blank_lines = False, skipinitialspace = True, index_col = False, engine = 'c', na_values = na_vals, dtype = names_dict, encoding = 'iso-8859-1' )` TypeError: object of type 'ellipsis' has no len() – Leopold Wahlbeck Sep 07 '21 at 13:55
  • You have to replace `pd.read_csv('large.csv', ..., chunksize=chunksize)` by your `pd.read_csv(...)`. I didn't want to re-entering all the parameters for readability, so I used the ellipses. You have just to append `chunksize=chunksize`. Check also the function signature, I append a new parameter `chunksize`. – Corralien Sep 07 '21 at 14:01
  • Oh, I believe I get what you are saying now. However I get another error: TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid – Leopold Wahlbeck Sep 07 '21 at 14:21
  • Hey! I used the solution posted by padu. Thank you anyways! – Leopold Wahlbeck Sep 14 '21 at 11:17
  • Then mark my answer as a solution. Thanks! – padu Sep 14 '21 at 11:27
0

Since there was a question in the comments about the progress bar for some pandas dataframe methods, I will note the solution for such cases. Library parallelbar allows you to track the progress for such popular methods of the Pool class of the multiprocessing module as map, imap and imap_unordered. It is easy to adapt it for parallel work with pandas dataframes (and track progress) as follows:

# pip install parallelbar
from parallelbar import progress_map
import pandas as pd
import numpy as np
from multiprocessing import cpu_count

def parallelize_dataframe(df, func, split_size=cpu_count() * 4, **kwargs):
    df_split = np.array_split(df, split_size)
    result_df = pd.concat(progress_map(func, df_split, **kwargs),
                          ignore_index=True)
    return result_df

Where df - your dataframe; func - the function to be applied to the dataframe; split_size - how many parts you need to split df into for parallelization (usually the default value is a good choice); **kwargs-optional keyword arguments for progress_map function (see the documentation)

For example:

def foo(df):
    df[col] = pd.to_datetime(df[col])
    return df

if __name__=='__main__':
    new_df = parallelize_dataframe(df, foo)

not only will you see the progress of execution, but the execution of the pd.to_datetime function will be parallelized, which will significantly speed up your work.

padu
  • 689
  • 4
  • 10