0

I've two text files. I want to take text from first one between </sup><sup> tags, and insert it to another text file between {}.

Better example (sth like a dictionary)

Text1:

<sup>1</sup>dog
<sup>2</sup>cat
<sup>3</sup>lion
<sup>1</sup>flower
<sup>2</sup>tree
.
.

Text2:

\chapter1
\pkt{1}{}{labrador retirever is..}
\pkt{2}{}{home pets..}
\pkt{3}{}{wild cats..}
\chapter2
\pkt{1}{}{red rose}
\pkt{2}{}{lemon tree}
.
.

What I want:

Text3:

\chapter1
\pkt{1}{dog}{labrador retirever is..}
\pkt{2}{cat}{home pets..}
\pkt{3}{lion}{wild cats..}
\chapter2
\pkt{1}{flower}{red rose}
\pkt{2}{tree}{lemon tree}

Text is random, but You can see what I want. Perl would be best.

So get

</sup>**text**<sup>

and paste it to

\pkt{nr}{**here**}{this is translation of this word already stored in text2}.

Text A and B are in order, so if I could read first </sup>text<sup> from Text A, save it in temp, delete this line from Text A, put it on first free {} slot in text B, and start over again it would be great. Numbers will match because order is saved. Sorry for my English:) Thanks!

chuguruk
  • 3
  • 2
  • Is there always a and tag in text file 2, mapping a number to a string? – thomasa88 May 01 '11 at 14:05
  • Do you really need Regular Expressions? You should be able to just search for the patterns: Search for and {} – Steve Wellens May 01 '11 at 14:06
  • @thomasa88, @Steve Wellens: I've edited my post, so now you can see where my problem is. Regular Expressions are not necessary, anything that will work. – chuguruk May 01 '11 at 17:55

1 Answers1

2

This code puts all dict items in an array, in the order they appear. The tex file is then looped and each time \pkt{num}{} is hit an item from the array is inserted.

Newlines in dict are handled and replaced with spaces (Just remove this replace in the map if you don't want this behavior). \pkt should be found as long as the part \pkt{num}{} is not spanning multiple lines. Otherwise I think the easiest solution would be to undef $/ (the input record separator) and read the whole file into a string and just loop the replacement (could be a bit memory hungry though).

#!/usr/bin/perl -wT

use strict;

my $dict_filename = 'text1';
my $tex_filename = 'text2';
my $out_filename = 'text3';

open(DICT, $dict_filename);
my @dict;
{
    # Set newline separator to <sup>
    local $/ = '<sup>';
    # Throw away first "line", it will be empty
    <DICT>;
    # Extract string and throw away newlines
    @dict = map { $_ =~ m@</sup>\s*(.*?)\s*(?:<sup>|$)@s; $_ = $1; $_ =~ s/\n/ /g; $_; } <DICT>;
}
close(DICT);

open(TEX, $tex_filename);
open(OUT, ">$out_filename");

my $tex_line;
my $dict_pos = 0;
while($tex_line = <TEX>)
{
    # Replace any \pkt{num}{} with \pkt{num}{text}
    $tex_line =~ s|(\\pkt\{\d+\}\{)(\})|$1$dict[$dict_pos++]$2|g;

    print OUT $tex_line;
}

close(TEX);
close(OUT);
thomasa88
  • 630
  • 1
  • 4
  • 8
  • Perl would be better I think, but the problem is that numbers are repeating (text is unique). – chuguruk May 01 '11 at 16:19
  • @chuguruk: then you need to add an example with a repeated number and show what output you want in that case – ysth May 01 '11 at 16:55
  • @chuguruk I guess you could store a list of texts for each number in the dictionary if you are going to merge them somehow. Or merge them directly in the dictionary. Is it more complicated? Describe it in your question if you need help with that also. – thomasa88 May 01 '11 at 16:55
  • Thank you, but something isn't right. It do the job only for few {}, and in some cases it puts \pkt{4}{\pkt{4}{}{original text}. Both loops working the same way. Maybe for multiple lines to copy? – chuguruk May 01 '11 at 19:38
  • Hmm, strange.. Do you mean that you have some texts in the "dict"-file that are spanning multiple lines? Do you have multiple \pkt:s on the same line (I think this should be handled) or do some of the texts to be inserted contain "}"? Seeing as the result you get contains $1 from the outer match, it seems like the code is running of out texts in the dict-file and therefore getting no match to update $1. – thomasa88 May 01 '11 at 19:48
  • Yes, text from "dict"-file spanning 1 to 3 lines. Only first line was copied. I don't see any "{" or "}" in "dict"-file, but there are other characters like "',:;?().". I don't have multiple \pkt:s on the same line in input files, but in output which is wrong. It shuld be \pkt{4}{dict text}{original text}, but it is \pkt{4}{\pkt{4}{}{original text}, but most of {} are empty. – chuguruk May 01 '11 at 20:00
  • Works great, but how can I clean it up from junk tags before inserting to {}? I can easy clean it up after inserting to "{}" adding extra "$_ =~ s/text/replace/g;" while maping, but i need to clean everything between \*\[(.*), and \'\[(.*) before that, other way there will be some {} empty left. – chuguruk May 02 '11 at 14:54
  • Ok, I've opened it, clean it up with while loop and save as temp, then open that temp. But now I see that there are missing some lines {} to input, so it would be good to check if nr == \pkt{nr} and then paste it there. – chuguruk May 02 '11 at 16:50