How to replace multiple patterns in the same file, based on the line's first word?

Question

I have a list of phrases in a single file ( "phrases" ), each being on its own line.

I also have another file, which contains a list of words, each on a line ("words").

I wish to append an asterisk at the end of every phrase in "phrases", which begins with a word listed in "words".

For example:

File "phrases":

gone are the days
hello kitty
five and a half
these apples are green

File "words":

five
gone

Expected result in "phrases" after the operation:

gone are the days *
hello kitty
five and a half *
these apples are green

What I have done so far is this:

parallel -j0 -a words -q perl -i -ne 'print "$1 *" if /^({}\s.*)$/' phrases

But this truncates the file and sometimes (not always) gives me this error:

Can't remove phrases: No such file or directory, skipping file.

Because the edits will be made in concurrently, my intention is for it to search and replace ONLY those lines which start with the word while leaving the other ones intact. Otherwise the parallel concurrent execution will overwrite each other.

I am open to other concurrent methods as well.

I have two large data sets and this is the part of the application which takes the most time. I was hoping to make this run faster. — Ifa, Nov 27 '17 at 15:57
Have you considered that your existing implementation might be inefficient? — simbabque, Nov 27 '17 at 15:58

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

Why this problem is not a parallel process-scheduling?

Just imagine the internal dependency-chain of values, that are required to become output in a given, strict [SEQ]-controlled way at the very process-output end.

Fact No. 1 )
while it is very easy to spin-off more than one process by using the gnu-parallel syntax to make this imperatively executed at a shell level, it does not mean at all that each and every case of such co-existing processes does allow for a smooth and smart "just"-by-coincidence-[CONCURRENT] or even the true-[PARALLEL] process-scheduling for free.

Fact No. 2 )
the file:phrases must be processed in a pure-[SERIAL] manner, as the natural order ( SEQ ) matters and must be preserved, even for the resulting file-based output.

Fact No. 3 )
every file-based is, by design, a pure-[SERIAL] process, neither "just"-[CONCURRENT], nor true-[PARALLEL], until one invents a way, how to make harddisk-reading-device heads to be at one moment in time at multiple locations ( which is well beyond even a quantum entanglement and superposition tricks and magic on a subatomic scale ).

Fact No. 4 )
sure, one can imagine some sort of a space for concurrent processing, once a [SEQ]-read-input from a file:phrases is known, where some speedup might occur, if more than one ( [SEQ]-operated ) lookup would be processed -- but, again, based on a condition, that there are both resources ( for the multiple lookups to take place concurrently, without any adverse impact on the process-flow in case not all concurrent-processes get executed seamlessly ) and all these must have "pre-cached" the whole-"known-to-be"-static-content of file:words ( otherwise it will not help ), so as to become somewhat able to escape from the next ( [SEQ]-again ) pure-[SERIAL] fileIO-[SEQ]-ordered and concurrent-capacity restricted re-processing of the first-word match-finding, now imperatively required by some form of the gnu-parallel syntax to happen from more than one words-crawling processes.

One could easily pay way more than could ever receive from:

Improper or even naive process-scheduling may and does introduce add-on costs, that were never seen before in a pure-[SERIAL] code execution. Even the most light-weight concurrency-frameworks add-on costs ( and these costs do scale with N, if many concurrent-code executions appear to become imperatively materialised, literally at any cost )

Kindly read thoroughly details about Amdahl's Law, best altogether with its modern criticism, including the modern re-formulation with having both the overhead-strict add-on costs included and with atomic-units in-divisibility of code-execution, independent of number of processors available. In spite of its inital formulation as far as 50 years ago, the modern massive-parallel code-execution ecosystems still cannot learn better from this principal law's dependencies that no one can ever escape from.

So, always check for all [SEQ]-dependencies in a problem dependency-chains.
So, always check for all [PAR]-add-on overheads, before even dreaming about performance.

So you are saying this can't be done in parallel? Because in both my files every line is unique, i thought that if every parallel process only modifies one (unique) line, they will not get in the way of each other. Maybe i thought this wrong? — Ifa, Nov 27 '17 at 16:06
My man, there is not any file representation, which would have one line independent of the following one -- just take a HEX-editor, open the disk file and see what happens with the following line, if you start to modify the "current" line tail, by "adding" a "white_space" + "*" characters on the physical level representation of the file. It had nothing to do with any (non)-uniqueness ( which could be, but need not be a file-content property ( this was not a core feature of the objection, the [SEQ]-nature of the rows-to-process, that was requested to be preserved + the fileIO-[SEQ]-process, was — user3666197, Nov 27 '17 at 16:22
If it were so that a physical-representation of the file would have been a linked-list-of-char, which is not the case, there might be some chance to "modify" the graph-of-char-s in some other manner, but the files used to be roughly a linked-list-of-{ 4KB | other-sized }-atomic-storage-containers, that gets some BIOS-gymnastics to get put-to / read-from the HDD:CYL:HEAD:SECTOR 4D rotating **storage-device physical address space on the real spinning drive** ( not taking into account the { FileAllocationTable | other FS inventory-table } maintenance reads/writes during the flow of [SEQ]-time ). — user3666197, Nov 27 '17 at 16:28
Real devices operations might remain hidden from user-perspective, once HDD-drive controllers may and do perform theirs lowest-level caching, read-ahead-s + deferred writes and latency masking tricks, the O/S HDD-device-drivers introduces many smarter tricks for additional system-level pre-fetches, caching, fileIO-re-ordering, at last but not least, the O/S itself presents the highest level of abstraction ( way different, from the hard life of the file-representation on the actual physical device(s) ), so even as we have departed from magnetic-tapes, the files are still [SEQ]-of-atomic-records — user3666197, Nov 27 '17 at 16:36

ikegami · Answer 2 · 2017-11-27T18:23:32.687

1

perl -i -pe'
    BEGIN {
       my $words_qfn = shift(@ARGV);
       open(my $words_fh, "<", $words_qfn) or die $!;
       chomp( my @words = <$words_fh> );
       my $alt = join "|", map quotemeta, @words;
       $re = qr/^(?:$alt)\b.*\K/;
    }
    s/$re/ */;
' words phrases

edited Nov 27 '17 at 18:23

answered Nov 27 '17 at 15:47

ikegami

367,544
15
269
518

Not sure the replace quite works - that'll replace the target word with a `*`, and the OP just seems to want the line to be suffixed by a `*` when the word is present. – Sobrique Nov 27 '17 at 17:39
@Sobrique Did you notice `\K` in regex? It drops all matches so `s///` just adds `*`. (It could only make use of an extra space as well?) – zdim Nov 27 '17 at 17:40
Ah, sorry yes. Was looking at the substitution regex. Never mind then :). – Sobrique Nov 27 '17 at 17:44

score 1 · Accepted Answer · answered Nov 27 '17 at 17:53

This is not a good fit for parallel processing, because by far the most expensive operation you can do - usually - is reading from disk. CPU is much much faster.

Your problem is not CPU intensive, and so you won't gain much advantage to running in parallel. And worse - as you've found - you induce a race condition that can lead to file clobbering.

Practically speaking disk IO is done in chunks - multiple K - from the disk, that's fetched into cache and then fed to the OS in such a way that you can pretend that read works byte-by-byte.

If you read a file sequentially, predictive fetch allows the OS to be even more efficient about it, and just pull the whole file into cache as fast as possible, massively speeding up the processing.

Trying to parallelise and interleave this process at best has no effect, and can make things worse.

So with that in mind, you'd be better off not trying to parallel, and instead:

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $words_fh, '<', 'words' ) or die $!; 
my $words = join '|', map { s/\n//r } <$words_fh>;
   $words = qr/\b(?:$words)\b/;
close ( $words_fh );

print "Using match regex of: ", $words, "\n";

open ( my $phrases_fh, '<', 'phrases' ) or die $!;
while ( <$phrases_fh> ) { 
  if (m/$words/) {
      s/$/ */;
  }
  print;
}

Redirect output to the desired location.

The most expensive bit is the reading of files - it does this once and one only. Invoking the regex engine repeatedly for the same line for each search term would also be expensive, because you'd be doing it N * M times, where N is the number of words, and M is the number of lines.

So instead we compile a single regex, and match that, using the zero width \b word boundary marker (so it won't substring match).

Note - we don't quote the contents of words - that may be a bug or a feature, because it means you could add regex into the mix. (And that might break when we compile our regex).

If you want to ensure it's 'literal', then:

my $words = join '"', map { quotemeta } map { s/\n//r } <$words_fh>;

How to replace multiple patterns in the same file, based on the line's first word?

3 Answers3

Why this problem is not a parallel process-scheduling?

One could easily pay way more than could ever receive from: