I have parallel translated corpus in English-French (text.en,text.fr), each text includes around 500K of lines (sentences in source and target languge). what I want is to: 1- Remove the duplicated lines in both texts using python command; and avoid any alignment problem in both files. e.g: command deleted line 32 in text.en, then of course delete it in text.fr. 2- Then Split both files into Train/Dev/Test data, only 1K for dev, and 1K for test, and the rest for train. I need to split text.en and text.fr using the same command, so I could keep the alignment and corresponding sentences in both files. It would be better if I could extract test and dev data randomly, that will help getting better results. How can I do that? please write the commands. I appreciate any help, Thank you !
Asked
Active
Viewed 468 times
-2
-
1Do you mean a text line or a grammerly sentence when you say duplicate lines? – Azhy Jun 06 '18 at 15:52
-
You should probably hire a programmer for this task! – karakfa Jun 06 '18 at 15:55
-
I mean text line, each sentence is a line in the file.. – lura.zanobia Jun 06 '18 at 15:57
-
What have you tried so far? – jeremysprofile Jun 06 '18 at 16:20
-
I have tried weka,KFord, Cross-Validation and some python scripts,but all what they did is splitting only one file, I need to split both files and keep the alignment, thank you. – lura.zanobia Jun 06 '18 at 16:26
-
What have you tried so far? Where is your approach? – colidyre Jun 06 '18 at 16:50
1 Answers
-1
If when you say lines you mean grammer sentences then you need to split sentences firstly by :-
Eng = "..."
Frn = "..."
GEngLines = Eng.split(".");
GFrnLines = Frn.split(".");
for i in range(len(GEngLines)):
for j in range(len(GFrnLines)):
if GEngLines[i] == GFrnLines[j] :
GEngLines.remove(i);
GFrnLines.remove(j);
DevLinesNumber = 500
TestLinesNumber = 500
EngDevLines = []
EngTestLines = []
EngTrainLines = []
FrnDevLines = []
FrnTestLines = []
FrnTrainLines = []
for i in range(len(GEngLines)):
if i < DevLinesNumber :
EngDevLines.append(GEngLines[i])
FrnDevLines.append(GFrnLines[i]);
elif i >= DevLinesNumber and i < DevLinesNumber + TestLinesNumber :
EngTestLines.append(GEngLines[i])
FrnTestLines.append(GFrnLines[i]);
else:
EngTrainLines.append(GEngLines[i])
FrnTrainLines.append(GFrnLines[i]);
But dont forget to add two tabs(4 spaces) before end two lines because i am useing mobile i couldnt write easily.

Azhy
- 704
- 3
- 16
-
You compare there sentences in english to sentences french. How's that suppose to work? – Vlad Jun 06 '18 at 18:42