Removing duplicate lines and split parallel corpus

Question

I have parallel translated corpus in English-French (text.en,text.fr), each text includes around 500K of lines (sentences in source and target languge). what I want is to: 1- Remove the duplicated lines in both texts using python command; and avoid any alignment problem in both files. e.g: command deleted line 32 in text.en, then of course delete it in text.fr. 2- Then Split both files into Train/Dev/Test data, only 1K for dev, and 1K for test, and the rest for train. I need to split text.en and text.fr using the same command, so I could keep the alignment and corresponding sentences in both files. It would be better if I could extract test and dev data randomly, that will help getting better results. How can I do that? please write the commands. I appreciate any help, Thank you !

Do you mean a text line or a grammerly sentence when you say duplicate lines? — Azhy, Jun 06 '18 at 15:52
I have tried weka,KFord, Cross-Validation and some python scripts,but all what they did is splitting only one file, I need to split both files and keep the alignment, thank you. — lura.zanobia, Jun 06 '18 at 16:26

score -1 · Answer 1 · answered Jun 06 '18 at 16:41

If when you say lines you mean grammer sentences then you need to split sentences firstly by :-

Eng = "..."
Frn = "..."
GEngLines = Eng.split(".");
GFrnLines = Frn.split(".");

for i in range(len(GEngLines)):
    for j in range(len(GFrnLines)):
        if GEngLines[i] == GFrnLines[j] :
            GEngLines.remove(i);
            GFrnLines.remove(j);

DevLinesNumber = 500
TestLinesNumber = 500

EngDevLines = []
EngTestLines = []
EngTrainLines = []

FrnDevLines = []
FrnTestLines = []
FrnTrainLines = []

for i in range(len(GEngLines)):
    if i < DevLinesNumber :
        EngDevLines.append(GEngLines[i])
        FrnDevLines.append(GFrnLines[i]);
    elif i >= DevLinesNumber and i < DevLinesNumber + TestLinesNumber :
        EngTestLines.append(GEngLines[i])
        FrnTestLines.append(GFrnLines[i]);
    else:
EngTrainLines.append(GEngLines[i])
FrnTrainLines.append(GFrnLines[i]);

But dont forget to add two tabs(4 spaces) before end two lines because i am useing mobile i couldnt write easily.

You compare there sentences in english to sentences french. How's that suppose to work? — Vlad, Jun 06 '18 at 18:42

Removing duplicate lines and split parallel corpus

1 Answers1