2

I want to concatenate 400 csv large files with same number of columns but the columns names are different

For example

File#1

id 101mean
a1 2
a2 6

File#2

id 202mean
a1 3
a2 2

Expected result

id mean
a1 2
a2 6
a1 3
a2 2

The rows are one below the other

So what should I do to merge the files based on index of columns or change the headers for all csv large files, with Python or R?

Also I can not use pandas because the size of files are very large

Thanks in advance

Jamiu S.
  • 5,257
  • 5
  • 12
  • 34
Joman
  • 21
  • 5
  • Did you try [Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) which can handle more rows than Pandas? – Soumendra Dec 21 '22 at 11:40
  • 1
    if the columns have different names, but are in the same order, then in `R` you could use somthing like `data.table::rbindlist(lapply(my.files, data.table::fread), use.names = FALSE)` – Wimpel Dec 21 '22 at 11:47
  • yes, but I couldn't merge files so that the rows are under each other because the names of the columns are different @Soumendra how do I make the column names the same in all files so that I can concatenate them? – Joman Dec 21 '22 at 11:53
  • Check [this answer](https://stackoverflow.com/a/53813794/5014656) on renaming columns with Dask. – Soumendra Dec 21 '22 at 11:55

1 Answers1

1

If you only need the indexes, then you can just add all the rest of the data by ignoring the header row of each CSV file you want to add:

# Create the output header row
f = open('output.csv', 'w')
f.write("id,mean\n")

# Iterate for all the CSVs you want to merge together
csv_files = ['csv1.csv', 'csv2.csv']

for filename in csv_files:
    with open(filename) as open_csv:
        first_row = True
        for line in open_csv:
            # Ignore the header row
            if first_row:
                first_row = False
                continue
        
        # Add all the rest of the CSV data to the output file
        f.write(line)

# Close the output file
f.close()

First you add the header column for the new data, then you add the entirety of each CSV you want to append, but you ignore the first line of each one (since it would be the header line). As a diagram, it works like this:

id mean
id 101mean
a1 2
a2 6
id 202mean
a1 3
a2 2
Xiddoc
  • 3,369
  • 3
  • 11
  • 37