5

I have an issue where a client needs to duplicate a column in a CSV file. The values are always going to be identical and unfortunately our API doesn't allow for duplicate columns to be specified in the JSON.

For example I have the following column structure and values:

Name, Surname, City, Age, Job
John, Doe, Johannesburg, 28, Technical Support

Now I need to duplicate City so the output should be:

Name, Surname, City, City Again, Age, Job
John, Doe, Johannesburg, Johannesburg, 28, Technical Support

The column needs to be placed after that which will be duplicated. The value is also dependent on this first column.

anubhava
  • 761,203
  • 64
  • 569
  • 643
Predator
  • 138
  • 1
  • 7
  • The value in the case here is Johannesburg. The first example it appears once under City. It's duplicated in the second example under City Again – Predator May 25 '17 at 09:47
  • 1
    So, in the first record, containing your column names, $3 != $4. How do you want to generate your headers? (I don't see the code you're using to try and do this in your question, please add it.) – ghoti May 25 '17 at 14:56

4 Answers4

8

awk can handle this easily:

awk 'BEGIN{FS=OFS=", "} {$3 = $3 OFS $3} 1' file.csv

Name, Surname, City, City, Age, Job
John, Doe, Johannesburg, Johannesburg, 28, Technical Support

Note that this does the job in single and shorted command that is easy to read and is far more efficient than pipeline command that involves calling cut twice and then a `paste.

As @codeforester rightly commented below, cut doesn't let a column be repeated in the output; it is used for stripping out a value.

codeforester
  • 39,467
  • 16
  • 112
  • 140
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Yep, that would be the way to do this if it were an awk question instead of a bash question. :-) – ghoti May 25 '17 at 14:54
  • OP may not be aware of prowess of `awk` :) – anubhava May 25 '17 at 15:06
  • 3
    Very elegant and efficient. @anubhava - the point that `cut` doesn't let a column be repeated in its output could be noted in your answer. From `man cut`: **If a field or column is specified multiple times, it will appear only once in the output.** – codeforester May 25 '17 at 16:15
  • 1
    Thanks @codeforester for kind words. I have added this in my answer as well. – anubhava May 25 '17 at 17:28
  • I'm not 100% familiar with the tools of awk and cut. I'm mainly a PHP developer but our current project demands bash scripts to keep consistency across all of them. There's going to be about 10k lines that runs daily so I'll try all the solutions with test files. Thanks so much guys! – Predator May 26 '17 at 06:25
  • `cut` and `paste` are also external utilities same as `awk` that are found on every Unix/Linux system. – anubhava May 26 '17 at 08:47
  • It is because `cut/paste` solution involves forking sub-shell twice + calling 3 external utilities vs a call to single external utility `awk`. – anubhava May 26 '17 at 09:21
  • 1
    I created a csv file of `25000000` lines. `awk` command took 58 sec to finish. And `cut+paste` is still running for last 25 minutes. – anubhava May 26 '17 at 11:56
  • awk is the greatest. – jimh Jan 27 '20 at 22:14
2

Assuming that City will always be located at column 3, the cut and paste commands can be utlized. For example:

csv=path/to/somefile.csv
echo "$(paste -d',' <(cut -d',' -f1-3 $csv) <(cut -d',' -f3- $csv))" > "$csv"

Notes:

  1. The path/to/quux.csv part, which is assigned to the csv variable, should be replaced with the actual path to your .csv file.
  2. The new content which includes the duplicated City column is written back to the same source file.
  3. The <(...) part is known as process substitution and is typically used when piping data is not possible. A good explanation of it can be found here.
RobC
  • 22,977
  • 20
  • 73
  • 80
  • 2
    This is very inefficient for big files as the whole file has to be read *three* times or slightly more than that! The [`awk`](https://stackoverflow.com/a/44182774/6862601) solution is far better as it reads the file only once. – codeforester May 25 '17 at 16:14
  • @codeforester - Thanks for commenting. Perhaps you can clarify what you mean by _"very inefficient for big files"_. The metrics I obtained, in terms of completion time, revealed that running the solution I provided across a `.csv` containing one million rows with five columns, averaged at 12.2 seconds. Whilst the _awk_ solution averaged at 13.7 seconds using the same data set. All times for both solutions were inclusive of the time taken to write the results back to disk . What metrics are you basing your comment on? Is the _awk_ solution particularly memory efficient or something? – RobC May 25 '17 at 22:36
  • In addition to the `awk` approach taking slightly longer to complete than the solution I provided, on average the `awk` solution also consumed more real memory too. The `awk` solution consumed 283Mb of memory, whilst this solution consumed 190Mb. These metrics were obtained whilst running the same test conditions mentioned in the previous comment. I can only conclude from these metrics that the comments regarding my solution being _"very inefficient for big files"_, are somewhat misleading and inaccurate. – RobC May 26 '17 at 09:04
  • 1
    If you use a big file iin input as @codeforester says.then difference in behavior will be evident. I created a csv file with 25 million lines and ran this command. It didn't complete even after an hour, didn't write any line in the output file and I had to kill the shell whereas `awk` command took 58 sec to complete. – anubhava May 26 '17 at 14:50
0

A while back, I wrote a bash function to insert array elements:

function array_insert {
    # options: arrayname index [value]
    if ! declare -p "$1" 2>/dev/null | grep -q '^declare -a'; then
        printf '%s: not an array: %s\n' "$0" "$1" >&2
        return 1
    fi
    local -n source="$1"
    local -a indices=( "${!source[@]}" )
    for ((i=${#indices[@]}-1; i>=$2; i--)) ; do
        source[$((i+1))]="${source[$i]}"
    done
    if [ -n "$3" ]; then
        source[$2]="$3"
    fi
}

With this function loaded, you might do the following:

while read line; do
  IFS=, declare a=( $line )     # assign the line's fields to an array,
  array_insert a 3 "${a[2]}"    # insert the column in this line,
  o=$(printf '%s,' "${a[@]}")   # assemble your output,
  printf '%s\n' "${o%,}"        # remove the trailing comma.
done < input.txt

For me, this provides the following output using your input:

Name, Surname, City, City, Age, Job
John, Doe, Johannesburg, Johannesburg, 28, Technical Support

Note that in bash, arrays are indexed from 0, so ${a[2]} is the third column.

ghoti
  • 45,319
  • 8
  • 65
  • 104
0

below is Small script which took two input

  1. File Name
  2. Column name and it will duplicate mention column

Below is Script

echo "Enter  CSV File  name "
read fileName

echo "Enter Column name to be duplicated "
read columnName


columnNumber=`head -1 $fileName | awk -v RS="," "/$columnName/{print NR;}"`     #Identify Column number using column name
totalNumberOfColumn=`head -1 $fileName | awk -F',' '{print NF}'`                #identify total number of column

str=""                                                                          #Create empty variab str to print column number in awk
for ((i=1;i<=$totalNumberOfColumn;i++));
do
str="$str \$$i\",\""
        if [ $i == $columnNumber ]
        then
        str="$str \$$i\",\""
        fi
done


awk -F',' "{print ${str}}" $fileName  | sed 's/,$//g'                           #Print all column inculding duplicate column 

below is content of file, file name is data.csv

:cat data.csv
Name, Surname, City, Age, Job
John, Doe, Johannesburg, 28, Technical Support

Output 1 for Duplicating column City

:bash script.sh
Enter  CSV File  name
data.csv
Enter Column name to be duplicated
City
Name, Surname, City, City, Age, Job
John, Doe, Johannesburg, Johannesburg, 28, Technical Support

Output 2 : for Duplicating column Name

:bash script.sh
Enter  CSV File  name
data.csv
Enter Column name to be duplicated
Name
Name,Name, Surname, City, Age, Job
John,John, Doe, Johannesburg, 28, Technical Support
Sujit Dhamale
  • 1,311
  • 11
  • 14