I wrote this code to get rid of the duplicates in a large (800000) tweets csv file but, when I run it, the file I get is larger than the original one: Original is 1,580,307 KB and the resulting file is 1,852,462 KB. I've tried with a smaller one of 20 rows, the original is 45KB and the resulting file I get in this case is 46KB. I appreciate if someone could please guide me in how this happens or what I'm doing wrong. I'm stuck!
import csv
import pandas as pd
geofile_input = r'GeoFile_20tweets.csv'
geofile_output = 'GeoFile_20tweets_output.csv'
file1= open(geofile_input, encoding="utf8")
reader1 = csv.reader(file1)
lines_in =len(list(reader1))
print('row_count csv input file: ', lines_in)
print('start reading the file on pandas')
df = pd.read_csv(geofile_input, sep=',')
print('dataframe', df.dtypes)
print('droping duplicates in pandas')
df.drop_duplicates(subset=None, keep='first', inplace=True)
print('saving the data frame in csv without duplicates')
df.to_csv(geofile_output,index=False, sep=',', header=True)
print('counting rows for the csv output')
file2= open(geofile_output, encoding="utf8")
reader2 = csv.reader(file2)
lines_out =len(list(reader2))
print('row_count csv output file: ', lines_out)
print('Process completed!')