0

I have a CSV file of 1 GB with around 1 million records, each row is 171 columns, I did some research and came up with this code. I have reduced the size of file to 5 MB for testing purposes, but there are still 171 columns. The code works fine as long as the sorting column index is below 50. Even on 49 it works fine, but I have columns with index 151, 153. I want to sort the file with those columns.

Error:

When I give it index 50 or above it throws the error:

  data.sort(key=operator.itemgetter(*sort_key_columns))
IndexError: list index out of range

My Code:

def sort_csv(csv_filename, sort_key_columns):
    data = []
    with open(csv_filename, 'r') as f:
        for row in csv.reader(f):
            data.append(row)
    data.sort(key=operator.itemgetter(*sort_key_columns))
    with open(csv_filename, 'w', newline='') as f:
        csv.writer(f).writerows(data)

sort_csv('Huge_Complete_B2B_file_1_1.csv', [49])
martineau
  • 119,623
  • 25
  • 170
  • 301
Wcan
  • 840
  • 1
  • 10
  • 31

2 Answers2

1

It seems that one of your files contains a truncated row below 51 columns.

If you don't care about your input being corrupt, you could filter it out while reading the input & sort it, in one line:

def sort_csv(csv_filename, sort_key_columns):
    with open(csv_filename, 'r') as f:
        data = sorted([row for row in csv.reader(f) if len(row)>=171],key=operator.itemgetter(*sort_key_columns))
    # then write the file
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Thankyou so very much!!! It worked 2 questions: (1) when i give it two columns why does it only sort with the first one. (2) how can i avoid first line from sorting – Wcan Oct 09 '17 at 20:26
  • like [155, 156] it only does the 155 part – Wcan Oct 09 '17 at 20:28
1

You can handle the short row by writing your own version of operator.itemgetter which you can base on the code equivalent to it which is shown in its online documentation.

The custom version below simply supplies a specified value for any that are missing. This will cause the row to be sorted as though it had that value in it at that indexed position.

Note: That this assumes all the missing items should use the same default MISSING value. If that's not the case, it could be enhanced to allow a different one to be specified for each index in the sequence of them passed to it. This would likely require an additional argument.

import csv
import operator

def my_itemgetter(*indexes, MISSING=''):
    if len(indexes) == 1:
        index = indexes[0]
        def getter(obj):
            try:
                return obj[index]
            except IndexError:
                return MISSING
    else:
        def getter(obj):
            try:
                return tuple(obj[index] for index in indexes)
            except IndexError:
                return tuple(obj[index] if index < len(obj) else MISSING
                                for index in indexes)
    return getter

def sort_csv(csv_filename, sort_key_columns):
    with open(csv_filename, 'r', newline='') as f:
        data = [row for row in csv.reader(f)]

    data.sort(key=my_itemgetter(*sort_key_columns))

    with open(csv_filename, 'w', newline='') as f:
        csv.writer(f).writerows(data)

sort_csv('Huge_Complete_B2B_file_1_1.csv', [0, 171])
martineau
  • 119,623
  • 25
  • 170
  • 301
  • Working great !!! Its sorting with both columns index now. Is there any way i can avoid reading the first row within this same code ? – Wcan Oct 10 '17 at 07:41
  • To skip the first row, just put a `next(f)` immediately _before_ the `data = [row for...` line. **Note** this will also effectively remove the header in the output file. Also see [_When processing CSV data, how do I ignore the first line of data?_](https://stackoverflow.com/questions/11349333/when-processing-csv-data-how-do-i-ignore-the-first-line-of-data). – martineau Oct 10 '17 at 14:59
  • P.S. You can preserve the first header row by using `header = next(f)` and then output it manually when overwriting the file at the end with a `f.write(header)` before the `writerows()` call. – martineau Oct 10 '17 at 15:14