0

I have utf-8 encoded comma-delimited csv file that one of the columns contains multiple commas however I need to import them as one column for further manipulation. The data frame looks like

C1 C2 C3 C4 C5 C6      C7.... C27
1, 2, 3, 4, 5, A,B,C,   2 .......
3, 5, 3, 4, 6, A,B,C,D, 8 .......
1, 2, 2, 5, 8, A,B,     7 .......
3, 5, 3, 4, 6, ABCDE,   8 .......
1, 2, 3, 4, 5, A,B,C,D  2 .......

So the column 6 contains some Chinese character as well as different number of commas. The columns 5 and 7 are all numeric. The data frame has 27 columns in total. I want the characters in the 6th columns treated as value in one cell instead of values for more than one variables.

I know that you can use quotation sign first but I'm wondering how exactly you would do it. I have more than 1000 files like this that I have to open.

Any suggestions would be appreciated!

A follow-up question: What if the number of columns are different for different files? Is it possible to use regular expression to define the pattern of columns and get the number of the columns first, and then decide how to split the columns?

I am thinking now to get the columns of each files first and save them to a csv file, and then use the method in the possible duplicate question. But any suggestions on a more efficient way would be appreciated!

RTian
  • 5
  • 5
  • 1
    Can't you just change the delimiter to a semicolon or tab? – chatax Feb 21 '20 at 17:42
  • 1
    Do you have control over the creation of these .CSV files, or do they come from a closed source? If the original source program followed [RFC 4180](https://tools.ietf.org/html/rfc4180) which defines .CSV files, they would have enclosed C6 in quotes, something like this: `1,2,3,4,5,"A,B,C",2,....` or as @chatax said while I was typing this, change the field delimiter to something else like this: `1;2;3;4;5;A,B,C;2.....` – Matthew Feb 21 '20 at 17:43
  • @chatax Hey, no I can't. It's downloaded from a database. So it's impossible to change the delimiter exclusively for those columns that do not have commas. – RTian Feb 21 '20 at 17:45
  • 3
    possible duplicate of [python-issue-with-delimiter-inside-column-data](https://stackoverflow.com/questions/43251122/python-issue-with-delimiter-inside-column-data) – effy Feb 21 '20 at 17:46
  • @Matthew Thanks for the comments. I'm not sure how they are created but I will try first and tell you how it goes later... – RTian Feb 21 '20 at 17:47
  • @Mattew Hi again, unfortunately values in C6 are not quoted... And I don't think the possible duplicate one works for me. As in that post, it has only 1 delimiter contained in the column, while in my case I have indefinite amount of delimiters. – RTian Feb 21 '20 at 18:05
  • @RTian if you know the columns before and in which one the problem occurs, the accepted answer in that question would resolve your problem, because it joins all the values after second column and before the forth column from the end – effy Feb 21 '20 at 18:22
  • @effy Hey, thanks for the clarification! Now I know how to proceed. – RTian Feb 21 '20 at 18:24
  • Are all columns that do not have commas in their values, guaranteed to be numeric? – trincot Feb 22 '20 at 20:42
  • @trincot Hi, all columns do not have commas except for C6 column, however not all of theme are numeric... Sorry for not being clear at the first place. – RTian Feb 24 '20 at 08:54

1 Answers1

0

Since you know what the desired number of rows are what you want to do is take the difference between the back of the row and the front using set(). You can just change the num_cols for other files.

import csv

filename = 'mycsv.csv'
num_cols = 26 # "The data frame has 27 columns in total"

with open(filename, newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        try:
            assert len(row) >= num_cols, f'The csv file does not contain at least {num_cols} columns.'
            after_sixth = row[-21:] # everything after the '6th' column
            before_sixth = row[:5]
            everything_else = after_six + before_sixth
            sixth_row = set(row)- set(everything_else)
            new_row = before_sixth + sixth_row + everything_else
            print(new_row)
        except AssertionError as e:
            print(e)
Doug
  • 234
  • 4
  • 11
  • Hi, thanks for the answer. This only extracts all the information after 5th column and save them as the sixth column. Then the similar issue remains that I need to split the real column 6 from this new column six. – RTian Feb 24 '20 at 09:55
  • Didn't see "the data frame has 27 columns in total". I updated the solution. – Doug Feb 24 '20 at 14:35
  • Hey, I managed to get the output I wanted. My method is a bit different from yours. I use a loop first to identify the number of columns before and after the problematic column for each file and then slice each csv-file into three pieces. Since you are the only one who answers the question, I set yours as accepted answer. – RTian Feb 24 '20 at 20:00