0

I wrote this code to get rid of the duplicates in a large (800000) tweets csv file but, when I run it, the file I get is larger than the original one: Original is 1,580,307 KB and the resulting file is 1,852,462 KB. I've tried with a smaller one of 20 rows, the original is 45KB and the resulting file I get in this case is 46KB. I appreciate if someone could please guide me in how this happens or what I'm doing wrong. I'm stuck!

import csv 
import pandas as pd


geofile_input = r'GeoFile_20tweets.csv'
geofile_output = 'GeoFile_20tweets_output.csv'

file1= open(geofile_input, encoding="utf8")
reader1 = csv.reader(file1)
lines_in =len(list(reader1))
print('row_count csv input file: ', lines_in)

print('start reading the file on pandas')
df = pd.read_csv(geofile_input, sep=',')

print('dataframe', df.dtypes)

print('droping duplicates in pandas')
df.drop_duplicates(subset=None, keep='first', inplace=True)


print('saving the data frame in csv without duplicates')
df.to_csv(geofile_output,index=False, sep=',', header=True)

print('counting rows for the csv output')
file2= open(geofile_output, encoding="utf8")
reader2 = csv.reader(file2)
lines_out =len(list(reader2))

print('row_count csv output file: ', lines_out)
print('Process completed!')
Sofia
  • 11
  • 3
  • use `wc -l filename` in bash to count the lines in the file. In fact, you will find it all same. The file size is change, maybe because when you read it in pandas, the data type in the column may change. e.g. when read a int column, when some rows have float in it, the column will change to float type, as '1' -> '1.0' – Ferris Jan 20 '21 at 08:12
  • try `df = pd.read_csv(file, dtype=str)` – Ferris Jan 20 '21 at 08:20

1 Answers1

0

demo data

cmd = '''
cat > test.csv << 'EOF'
a,b,c,d
1,2,1,1
1,2,1,1
1,2,1,1
1,2,1,1
1,2,1,1
1,2,1,1
1,2,1,1
1,2,1.0,1
EOF
'''

pycmd = lambda cmd: get_ipython().system(cmd)
pycmd(cmd)

df = pd.read_csv('test.csv')
df.to_csv('test_1.csv', index=False)

# -rw-r--r--. 1 root     88 Jan 20 16:16 test_1.csv
# -rw-r--r--. 1 root     74 Jan 20 16:15 test.csv

!cat test_1.csv
a,b,c,d
1,2,1.0,1
1,2,1.0,1
1,2,1.0,1
1,2,1.0,1
1,2,1.0,1
1,2,1.0,1
1,2,1.0,1
1,2,1.0,1
Ferris
  • 5,325
  • 1
  • 14
  • 23