0

Hi I want to Generate list of numbers from 1000000 to 2000000 but the problem is that I get an error memory error I was using random everything was good only I get dublcated number and I cant have duplcated number so i switched to xrange

data = []
total = 2000000
def resource_file(info):
    with open(info, "r") as data_file:
        reader = csv_reader(data_file, delimiter=",")
        for row in reader:
            try:
                for i in xrange(1000000,total):
                    new_row = [row[0], row[1], i]
                    data.append(new_row)
            except IndexError as error:
                print(error)
    with open(work_dir + "new_data.csv", "w") as new_data:
        writer = csv_writer(new_data, delimiter=",")
        for new_row in data:
            writer.writerow(new_row)
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
Mike
  • 15
  • 1
  • 5
  • 1
    You're trying to store the whole thing in memory before writing any of it out. You could use less memory by only processing one row at a time, rather than trying to store the entire file in memory. – Tom Karzes Oct 29 '17 at 08:37
  • Are you sure you want to create 1000000 times more elements than there are in the input CSV file? What is the desired outcome? Can you give an small example of the CSV file, and what you expect the resulting CSV file to look like? – trincot Oct 29 '17 at 08:40
  • I want to add a number for every line in the csv file in row number 2 – Mike Oct 29 '17 at 09:43

1 Answers1

3

Repeat every line with an extra column ranging 1M..2M

The problem is that you store all these configurations in memory first. First of Python has not a very efficient memory model, and furthermore one million entries per row is quite large anyway.

I propose to not store the data in a list, but simply write these to the file immediately:

total = 2000000
def resource_file(info):
    with open(info, "r") as data_file:
        reader = csv_reader(data_file, delimiter=",")
        with open(work_dir + "new_data.csv", "w") as new_data:
            writer = csv_writer(new_data, delimiter=",")
            for row in reader:
                rowa, rowb = row[0:2]
                for data in xrange(1000000,total):
                    writer.writerow([rowa,rowb,data])

Take rows 1M-2M of the file

In case you want to take lines 1M to 2M of the original file, you can write it as:

from itertools import islice

total = 2000000
def resource_file(info):
    with open(info, "r") as data_file:
        reader = csv_reader(data_file, delimiter=",")
        with open(work_dir + "new_data.csv", "w") as new_data:
            writer = csv_writer(new_data, delimiter=",")
            for row in islice(reader,1000000,total):
                writer.writerow(row)

or you can simplify it, like @JonClemens says, with:

from itertools import islice

total = 2000000
def resource_file(info):
    with open(info, "r") as data_file:
        reader = csv_reader(data_file, delimiter=",")
        with open(work_dir + "new_data.csv", "w") as new_data:
            writer = csv_writer(new_data, delimiter=",")
            writer.writerows(islice(reader,1000000,total))
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • but i think in this way rowa, rowb will be outside the loop right – Mike Oct 29 '17 at 09:53
  • yes but every line in the info file will be looped = the amount of the total 2000000 – Mike Oct 29 '17 at 10:01
  • @Mike: so you only want lines 1M-2M of the original file then? What is `data` doing here? A loop counter? – Willem Van Onsem Oct 29 '17 at 10:02
  • data is what number the range start and where its end as you can see now every line is looping 2000000 ['fjc7kr1m92su8eljhb11484kwz.net', 'post dga (malware)', 1032067] ['fjc7kr1m92su8eljhb11484kwz.net', 'post dga (malware)', 1032068] ['fjc7kr1m92su8eljhb11484kwz.net', 'post dga (malware)', 1032069] ['fjc7kr1m92su8eljhb11484kwz.net', 'post dga (malware)', 1032070] ['fjc7kr1m92su8eljhb11484kwz.net', 'post dga (malware)', 1032071] – Mike Oct 29 '17 at 10:04
  • @Mike: then there is no reason to write such range function. See edit, second solution. – Willem Van Onsem Oct 29 '17 at 10:05
  • If you're not doing much else... `writer.writerows(islice(reader, 1000000, total))` covers it – Jon Clements Oct 29 '17 at 10:17