I have a nlp datasets (about 300K samples) where there exits duplicate data. I want to split it to train test (70%-30%), and they should have no overlapping.
For instance:
|dataset: | train | test |
| a | a | c |
| a | a | c |
| b | b | c |
| b | b | |
| b | b | |
| c | d | |
| c | d | |
| c | | |
| d | | |
| d | | |
I have tired exhaustively random sample, but it too time consuming.