0

I have a very large excel file (150000 rows). For each row I have a string in column 1 that I need to find and remove from column 2.

Input column 1    Input column 2                     Output column3
-------------------------------------------------------------------------
dog3456           dog3456; cat2345; chicken3595      cat2345; chicken3595
sheep23349        bird2048; rhino2948; sheep23349    sheep23349 
frog248034        toad3956; frog248034               frog248034

Does anyone have any advice please?

TylerH
  • 20,799
  • 66
  • 75
  • 101
chives
  • 9
  • 1
  • Not my area, but fairly sure awk and other shell tools don't handle excel files (ie xls/xlsx). You may be better off using a tool like Python to process the file. – Gavin May 15 '18 at 10:06
  • I've also got the file in a tab delimited and csv format – chives May 15 '18 at 10:09
  • OK - probably possible with awk etc then. Other thing is your description and example don't match (First line of example matches description but line 2 and 3 show opposite behaviour). Which is it? – Gavin May 15 '18 at 10:18
  • Did you try anything? Also, please edit the data properly, preceed the data with four spaces or use the `{}` in the editor. Real tabs would be nice. – James Brown May 15 '18 at 10:31
  • Sorry, Iine 1 shows the correct formatting. Please ignore lines 2 and 3. – chives May 15 '18 at 10:35
  • @chives, please wrap up your code in CODE TAGS a button `{}` which you will see during your post edit and let us know then complete requirements of your question. – RavinderSingh13 May 15 '18 at 11:34
  • What's your question? Are you asking for help with how to write code to do what you want? Or do you have code and want it to be improved? – TylerH Sep 02 '22 at 15:28

1 Answers1

0

I decided you are exporting your two columns of spread sheet as tab delimited.

cat data.tab  
dog3456 dog3456; cat2345; chicken3595
sheep23349  bird2048; rhino2948; sheep23349
frog248034  toad3956; frog248034

and the third column is output by the awk script.

awk -F'\t' '{split($2,a,";"); for(i in a){gsub(" ","",a[i]); if(a[i]!=$1)printf("%s; ",a[i])}print ""}' data.tab 

outputs:

cat2345; chicken3595; 
bird2048; rhino2948; 
toad3956; 
TylerH
  • 20,799
  • 66
  • 75
  • 101
tomc
  • 1,146
  • 6
  • 10