How to delete duplicacy on certain conditions

Question

This is a example of what i am trying to do on a different dataset but this still not working

PORT    STATUS   VESSEL         DWT      IMP/EXP    QTY (Mts)

1 KANDLA    SAILED  CAPTAIN HAMADA  7938 EXP   4500
2 KAKINADA  EXPECTED CELON BREEZE       IMP      30000
3  KAKINADA BERTH    CELON BREEZE       IMP     3000
4 KAKINADA  SAILED   CELON BREEZE       IMP     30000
5 KANDLA    ANCHORAGE CAPTAIN HAMADA    EXP  4500
6 KAKINADA  BERTH    CELON BREEZE       IMP     30000

And i want to compare a row's (PORT,VESSEL,IMP/EXP) with another rows ,if it matches then delete like if IMP/EXP in the row is "IMP" then delete the row in priority order of STATUS: sailed> berth> anchorage > expected It will give highest priority to sailed =status and other have anchorage and delete 2nd row since it matches the qty,port,vessel with the 4th row. and so on if condition matches then see

  1 ) status=sailed and other have berth ,it will delete berth row
  2) sailed and other have expected,it will delete expected row
   3)if some row have berth and other have anchorage will delete anchorage
  4)if some has expected=STATUS & other row have sailed=STATUS it will delete              

    "expected"=STATUS   row

so on Row should match the condition i.e qty,port,vessel to delete the row according to the conditions

for EXP in IMP/EXP it should matches condition i.e,qty,port,vessel
condition of priority in STATUS:

     priority- sailed>anchorage>expected>  berth

The OUTPUT should be

PORT    STATUS   VESSEL              DWT    IMP/EXP QTY (Mts)

1 KANDLA    SAILED  CAPTAIN HAMADA  7938         EXP    4500
3  KAKINADA BERTH    CELON BREEZE             IMP      3000
4 KAKINADA  SAILED   CELON BREEZE             IMP      30000

2nd,5TH,6th row is deleted is the desired output

from your example I am a little bit confused if you are trying to get unique rows? If yes, try `unique(hey)`. Otherwise, can you specify which two columns are you checking to see if they match in a given row? — Sal, Jul 21 '17 at 05:34
Its a sample i dont need unique exactly can you fix the problem in the same way by editing the code because my actual problem statement uses different cases to delete the row which is not working by using this technique. @Sal — Rishabh Kashyap, Jul 21 '17 at 05:38
@RishabhKashyap - so what do you want? Your code doesn't make much sense. I'm guessing some combination of `?duplicated` will get you there, but you're going to have to be clearer as to your criteria for deletion. — thelatemail, Jul 21 '17 at 05:41
If it is a specific column, use `duplicated` i.e `hey[!duplicated(hey[,1]),]` — akrun, Jul 21 '17 at 05:41
to delete the duplicate row,if variables like qty,vessel,receiver of a row matches with any other row. Then if the status(IMP/EXP) variable of the row is import Then i need to delete the row acc to the IMP/EX attribute(IMPORT/EXPORT) Acc to their priority for import STATUS will give priority to: sailed> berth>anchorage>expected If 1st row status=sailed & any other row which matches the attribute like qty,vessel,receiver with the 1st have Status=berth Then row having status=berth will be deleted & so on 1st have status=sailed and other have status=anchorage then other row del @thelatemail — Rishabh Kashyap, Jul 21 '17 at 05:45
You comments are not making any sense without looking at a fully representative sample data. Also make your problem reproducible. — Sal, Jul 21 '17 at 05:47
Sure, will be happy to help. In the second line `7938 EXP` is it a single string — akrun, Jul 21 '17 at 06:58
were you able to import this data into R environment in a data.frame? — tushaR, Jul 21 '17 at 07:01

tushaR · Answer 1 · 2017-07-21T16:18:54.777

1

First of all you need to read the data into R in a data.frame. The data.frame test should look like this:

>test

#      PORT    STATUS         VESSEL  DWT IMPEXP   QTY
#1   KANDLA    SAILED CAPTAIN HAMADA 7938    EXP  4500
#2 KAKINADA  EXPECTED   CELON BREEZE   NA    IMP 30000
#3 KAKINADA     BERTH   CELON BREEZE   NA    IMP  3000
#4 KAKINADA    SAILED   CELON BREEZE   NA    IMP 30000
#5   KANDLA ANCHORAGE CAPTAIN HAMADA   NA    EXP  4500
#6 KAKINADA     BERTH   CELON BREEZE   NA    IMP 30000

Using the plyr package's ddply function you should be able to get the desired output with the help of tfollowing function.

ddply(test,.variables = c("PORT","VESSEL","IMPEXP","QTY"),
  function(t){if(t$IMPEXP[1]=="IMP"){
    t$STATUS<-factor(x = t$STATUS,levels =c("EXPECTED","ANCHORAGE","BERTH","SAILED"),ordered = T)
    return(t[which.max(as.integer(t$STATUS)),])
  }else{
    t$STATUS<-factor(x = t$STATUS,levels =c("BERTH","EXPECTED","ANCHORAGE","SAILED"),ordered = T)
    return(t[which.max(as.integer(t$STATUS)),])}
  }
)

#PORT STATUS         VESSEL  DWT IMPEXP   QTY
#1 KAKINADA  BERTH   CELON BREEZE   NA    IMP  3000
#2 KAKINADA SAILED   CELON BREEZE   NA    IMP 30000
#3   KANDLA SAILED CAPTAIN HAMADA 7938    EXP  4500

edited Jul 21 '17 at 16:18

answered Jul 21 '17 at 09:16

tushaR

3,083
1
20
33

the order is different for Import :sailed> berth> anchorage > expected and for export :sailed>anchorage>expected> berth And also there are some other STATUS that are random strings that how can i mention, so that they remain unaffected bcoz they become NAs and i need those strings for output also like there are: sailed anchorage expected berth VESSEL AT OUTER ANCHORAGE VESSEL AT OUTER ANCHORAGE VESSEL AT KUTUBDIA VESSEL AT ANCHOARGE VESSEL NOT ENTERING How can I do this so that VESSEL AT OUTER ANCHORAGE,VESSEL AT KUTUBDIA etc. remain unaffected bxoz they becoming NAs @Tushar – Rishabh Kashyap Jul 21 '17 at 09:44
@RishabhKashyap Start from the basics. You should have asked these questions first. Whatever issues you are facing, solve them one at a time and post a new question related to these issues. You need to clean your data first before jumping to process your data. – tushaR Jul 21 '17 at 09:47
sir just last query how to avoid NAs without mentioning the factors that are present and should be shown dat$STATUS<-factor(x = dat$STATUS,levels =c("BERTH","EXPECTED","ANCHORAGE","SAILED"),ordered=T) showing NAs where these any four factors are not present and i dont want to mention them bcoz i dont want to delete its duplicacy i want there presence in STATUS column how to edit this for them to prevent them from NAs dat$STATUS<-factor(x = dat$STATUS,levels =c("BERTH","EXPECTED","ANCHORAGE","SAILED",........),ordered=T) @Tushar – Rishabh Kashyap Jul 21 '17 at 10:08
@RishabhKashyap : That is what I am saying, **start from basics**. You should see what is a `factor` in R. Since your column `STATUS` has a lot of different strings, you need to perform either cleaning to reduce the factor with 4 `levels` only like `BERTH,ANCHORAGE,SAILED,EXPECTED` or your factor can have more `levels` like `BERTH,ANCHORAGE,SAILED,EXPECTED,VESSEL AT OUTER ANCHORAGE`. If you won't include "VESSEL AT OUTER ANCHORAGE" in the levels of the `factor` it will be converted to `NA`. – tushaR Jul 21 '17 at 10:41
@RishabhKashyap Also, I have edited the answer for different orders based on IMPEXP value. – tushaR Jul 21 '17 at 10:48

How to delete duplicacy on certain conditions

This is a example of what i am trying to do on a different dataset but this still not working

1 Answers1