-1

I need to get the length of csv files in ('/dir'/) excluding empty rows. I tried this:

import os, csv, itertools, glob

#To filer the empty lines
def filterfalse(predicate, iterable):
    # filterfalse(lambda x: x%2, range(10)) --> 0 2 4 6 8
    if predicate is None:
        predicate = bool
    for x in iterable:
        if not predicate(x):
            yield x

#To read each file in '/dir/', compute the length and write the output 'count.csv'
with open('count.csv', 'w') as out:
     file_list = glob.glob('/dir/*')
     for file_name in file_list:
         with open(file_name, 'r') as f:
              filt_f1 = filterfalse(lambda line: line.startswith('\n'), f)
              count = sum(1 for line in f if (filt_f1))
              out.write('{c} {f}\n'.format(c = count, f = file_name))

I get the output I'd like, but unfortunately the length of each file (in '/dir/') includes empty rows.

To see where empty rows are coming from I read file.csv as file.txt and it looks like this:

*text,favorited,favoriteCount,...
"Retweeted user (@user):...
'empty row'
Do Operators...*
martineau
  • 119,623
  • 25
  • 170
  • 301

2 Answers2

1

I would recommend using pandas.

import pandas

# Reads csv file and converts it to pandas dataframe.
df = pandas.read_csv('myfile.csv')

# Removes rows where data is missing.
df.dropna(inplace=True)

# Gets length of dataframe and displays it.
df_length = df.count + 1
print('The length of the CSV file is', df_length)

Documentation: http://pandas.pydata.org/pandas-docs/version/0.18.0/

O. Edholm
  • 2,044
  • 2
  • 12
  • 13
1

Your filterfalse() function performs correctly. It's exactly the same as the one named ifilterfalse in the standard library itertools module, so it's unclear why you don't just use that rather than write your own — a major advantage being that it's already been tested and debugged. (Built-ins are often faster, too, since many are written in C.)

The problem is you're not using the generator function properly.

  1. Since it returns a generator object, one needs to iterate over the multiple values it will potentially yield using code like for line in filt_f1.

  2. The predicate function argument you give doesn't handle lines that have other leading whitespace characters in them, like spaces and tabs, properly. — so the lambda you pass it needs to be modified to handle those cases, too.

The code below has both of these changes made to it.

import os, csv, itertools, glob

#To filter the empty lines
def filterfalse(predicate, iterable):
    # filterfalse(lambda x: x%2, range(10)) --> 0 2 4 6 8
    if predicate is None:
        predicate = bool
    for x in iterable:
        if not predicate(x):
            yield x

#To read each file in '/dir/', compute the length and write the output 'count.csv'
with open('count.csv', 'w') as out:
    file_list = glob.glob('/dir/*')
    for file_name in file_list:
        with open(file_name, 'r') as f:
            filt_f1 = filterfalse(lambda line: not line.strip(), f)  # CHANGED
            count = sum(1 for line in filt_f1)  # CHANGED
            out.write('{c} {f}\n'.format(c=count, f=file_name))
martineau
  • 119,623
  • 25
  • 170
  • 301