2

I am building a CSV chunk by chunk using the csv module from the standard library.

This means that I am adding rows one by one in a loop. Each row that I add contains information for each column of my dataframe.

So, I have this CSV:

A     B      C     D

And I am adding rows one by one:

    A       B      C      D
  aaaaa   bbb    ccccc   ddddd
  a1a1a   b1b1   c1c1c1  d1d1d1
  a2a2a   b2b2   c2c2c2  d2d2d2

And so on.

My problem is that sometimes, the row that I am adding contains MORE information (that is, information that does not have a column). For example:

    A       B      C      D
  aaaaa   bbb    ccccc   ddddd
  a1a1a   b1b1   c1c1c1  d1d1d1
  a2a2a   b2b2   c2c2c2  d2d2d2
  a3a3a   b3b3   c3c3c3  d3d3d3   e3e3e3  #this row has extra information

My question is: Is there any way to make the CSV grow (during runtime) when that happens? (with 'grow' I mean to add the "extra" columns)

So basically I want this to happen:

    A       B      C       D        E    # this column was added because 
  aaaaa   bbb    ccccc   ddddd           # of the extra column found
  a1a1a   b1b1   c1c1c1  d1d1d1          # in the new row
  a2a2a   b2b2   c2c2c2  d2d2d2
  a3a3a   b3b3   c3c3c3  d3d3d3   e3e3e3

I am adding the rows using the csv module from the standard library, the with statement and a dictionary:

import csv

addThis = {A:'a3a3a', B:'b3b3', C:'c3c3c3', D:'d3d3d3', E:'e3e3e3'}

with open('csvFile', 'a') as f:
    writer = csv.writer(f)
    writer.writerow(addThis)

As you can see, in the dictionary that I'm adding, I specify the name of the new column. What happens when I try that is that I get this exception:

ValueError: dict contains fields not in fieldnames: 'E'

I have tried adding the "extra" fieldname to the csv before adding the row like this:

fields = writer.__getattribute__('fieldnames')
writer.fieldnames = fields + ['E']

Note: It seems from this example that I already now that E will be added but that is not the case. I showed it like this just for the example. I don't know what the "extra" data will be until I get the "extra" rows (which I get over a period of time from a web scrape).

That manages to evade the exception, but does not add the extra column, so I end up with something like this:

    A       B      C       D
  aaaaa   bbb    ccccc   ddddd
  a1a1a   b1b1   c1c1c1  d1d1d1
  a2a2a   b2b2   c2c2c2  d2d2d2
  a3a3a   b3b3   c3c3c3  d3d3d3   e3e3e3   # value is added but the column
                                           # name is not there

I am not using Pandas because I understand that Pandas is designed to load fully populated DataFrames, but I am open to using something besides the csv module if you suggest it. Any ideas regarding that?

Thanks for your help and sorry for the long question, I tried to be as clear as possible.

Fabián Montero
  • 1,613
  • 1
  • 16
  • 34
  • Isn't E sth you know about before opening the file, so you could write it right from the beginning both in the header and in your rows normally as empty string or None and only sometimes a certain value? Then I think no library would complain about this file. – SpghttCd Jul 24 '18 at 18:21
  • @SpghttCd no, I dont know about E from the start. I just showed it like that for the example. I get each row from a web scraping I'm doing. – Fabián Montero Jul 24 '18 at 18:27
  • Why do you know about A B C D but not E...? However, this is futile as long as you're not getting more concrete. But if all this is as you describe, csv row by row is not the way to go for you. In a csv all rows must contain the same number of separators as the header. Otherwise you won't be able to read it with csv capable libraries. – SpghttCd Jul 24 '18 at 18:36
  • @SpghttCd Because I'm getting the information from a web scrap. My problem is that I'm treating CSV like a non-relational database, in which tables can grow horizontally. – Fabián Montero Jul 24 '18 at 18:38
  • 1
    Then you should consider using an in-memory datastructure like e.g. a numpy array in order to write this after completion of your scrapery into a csv. Otherwise you should use file formats which do support dynamical growth of arbitrary dimensions like the database you mentioned yourself or netCDF, hdf5, perhaps mat I could imagine... – SpghttCd Jul 24 '18 at 19:17

1 Answers1

3

I think you would need to rewrite the entire file when that happens. Currently you are opening the file with a so you can only append stuff at the end, and not add something in the middle of the file. I don't think there is an easy solution to add something in the middle of a file.

The easiest solution would then be to read the entire file into memory, add the new column to the header row and then rewrite the complete file.

See this question for an example of how you could do that.

Bob
  • 614
  • 1
  • 7
  • 19
  • The problem with this, is that I add a new column every second (or so) and in total I'll get about 1000 columns by the end of the routine. So rewriting the whole file every time would be very slow and inefficient. – Fabián Montero Jul 24 '18 at 18:36
  • Wouldn't it then be better to not write everything to a file all the time but instead just build it up in memory and only writing it to the file when you need to actually store it? – Bob Jul 24 '18 at 18:38
  • Yes, thats what I'll do, thank you! I'll accept your answer. – Fabián Montero Jul 24 '18 at 19:03
  • Cool, I glad I could help! – Bob Jul 24 '18 at 19:21