0

I have imported a .csv file using df=pandas.read_csv(.....). I calculated the number of rows in this dataframe df using print(len(df)) and it's size had some 30 rows less than the originally imported file. BUT, when I exported df directly after import as .csv (without doing any operation on this dataframe df) using df.to_csv(....), then the exported file had the same number of rows as the originally imported .csv file.

It's very hard for me to debug so as to explain the diffference between the lengths of the dataframe on one hand and both imported and exported .csv file on the other, as there are more than half-a-million rows in the dataset. Can anyone provide some hints as to what can cause such a bizzare behavior?

cph_sto
  • 7,189
  • 12
  • 42
  • 78
  • what version of pandas? – tsionyx Sep 13 '17 at 08:27
  • Well it's `0.19.2` – cph_sto Sep 13 '17 at 08:29
  • @OliverS Could you add the full pandas.read_csv parameters, the shape of the length of the dataframe and the actual length of the csv. – Mohamed Ali JAMAOUI Sep 13 '17 at 08:51
  • @MedAli - a bit in pseudocode with all relevant information `df=pandas.read_csv('file_path',sep='\t',encoding='latin-1',skiprows=1,decimal=',',na_values=[''],dtype=object,names=['Col1','Col2',....'Col43'])` and the shape of this dataframe `df` is `417059,43`. The original `.csv` file had `417085` rows and `43` columns. When this dataframe is exported as `.csv` the shape is again `417085` rows and `43` columns. – cph_sto Sep 13 '17 at 09:11
  • @OliverS Ok thanks. Better add the details to the question :) – Mohamed Ali JAMAOUI Sep 13 '17 at 09:14
  • @OliverS can you also try pd.read_csv with encoding='utf-8' and compare the number of rows? – Mohamed Ali JAMAOUI Sep 13 '17 at 09:16
  • @MedAli: Oh the shape of dataframe became `(414106,43)` rows and the exported file `420033` rows and `43` columns. May be encoding is causing problems....?? – cph_sto Sep 13 '17 at 09:21
  • @OliverS It might be that there are some special characters in your file.. Also pandas by default skipps blank lines i.e by default `skip_blank_lines=True`, so try to read with `skip_blank_lines=False`, with the first method and see if the number of rows is the same as the initial csv – Mohamed Ali JAMAOUI Sep 13 '17 at 09:30
  • @OliverS also try to read without specifying encoding and see if the number of rows stays the same as the csv. – Mohamed Ali JAMAOUI Sep 13 '17 at 09:32
  • @MedAli - Well, after removing the encoding, the result was exactly as with `utf-8` :( – cph_sto Sep 13 '17 at 09:42
  • What type of data are you working with? You could have encoding issues, and you may also have extraneous tabs or commas causing problems. Can you update the original post to include a few sample rows? – Lenwood Sep 13 '17 at 09:58
  • @Lenwood Well, that is something I cannot do, given data protection issues at my workspace, else I would have provided a small example dataset. As you also mentioned, I suppose it's either something to do with encoding(my all names are in German with sumptuous Umlauts like ö,ä,ü,ß) and we had some issues with encoding in the past. It could well be an extraneous Tab as well. Somehow I could upload the file successfully in `SAS`, and then I exported this `SAS` file to `.csv`. Now, when I import this new `.csv` file into Pandas/Python, all works perfectly. – cph_sto Sep 13 '17 at 11:09

0 Answers0