0

Given a file with content:

insert_job: J1
insert_job: J2
box_name: J1
insert_job: J3
box_name: J2
insert_job: J4
box_name: J1
insert_job: J5
box_name: J4
insert_job: J6
box_name: J4

I'd like to display it as following (use tab to identify child-parent relationship):

J1
    J2
        J3
    J4
        J5
        J6
test_data2 for Borodin:
------------------------------
insert_job: JS11-LR_BaselIII
insert_job: JS11-Check_Batch_Run_Numbers
box_name: JS11-LR_BaselIII
insert_job: 11000000-start
box_name: JS11-Check_Batch_Run_Numbers
insert_job: 11000000-runbox
box_name: JS11-Check_Batch_Run_Numbers
insert_job: JS11-Load_Session_Date
box_name: JS11-LR_BaselIII
insert_job: JS110000-start
box_name: JS11-Load_Session_Date
insert_job: JS110000-runbox
box_name: JS11-Load_Session_Date
insert_job: JS11-Start_RiskWatch
box_name: JS11-LR_BaselIII
insert_job: JS110004-start
box_name: JS11-Start_RiskWatch
insert_job: JS110004-runbox
box_name: JS11-Start_RiskWatch
insert_job: JS11-Start_UDS
box_name: JS11-LR_BaselIII
insert_job: JS110001-start
box_name: JS11-Start_UDS
insert_job: JS110001-runbox
box_name: JS11-Start_UDS
insert_job: JS11-Pool_Processing
box_name: JS11-LR_BaselIII
insert_job: JS110002-start
box_name: JS11-Pool_Processing

syntax error in Ed's solution:

sdpvvrsp810{alelai}: gawk -f tst.awk testjobs3
gawk: tst.awk:2: /^box_name/   { box = $2; jobs[box][job] }
gawk: tst.awk:2:                                    ^ syntax error
gawk: tst.awk:9:         for (job in jobs[box])
gawk: tst.awk:9:                         ^ syntax error
techie11
  • 1,243
  • 15
  • 30
  • I don't get it. What's the relationship/difference between a "insert_job" and a "box_name"? Why is J4 indented at the same level as J2 instead of J3? This should have a very simple solution of maybe 10 to 12 lines once we get a solid problem statement, e.g. see http://stackoverflow.com/a/23767534/1745001. And no, the solution will NOT involve calling `getline` (see http://awk.info/?tip/getline)! – Ed Morton May 27 '14 at 03:19
  • Hi Ed, in my problem, each job is enclosed by 0 or 1 "box". the box which J4 is in is J1, thus the relative position. I don't see awk code without getline can solve it...please advise if you come up with a simpler approach – techie11 May 27 '14 at 17:46
  • I'm JUST not seeing it at all from your posted input. What is it in that input file that tells you J4 is in box J1 rather than box J2, for example? – Ed Morton May 27 '14 at 20:26
  • Ah, I see now - the job is associated with the box id that FOLLOWS it, not one that precedes it. Got it. I just posted a solution. – Ed Morton May 27 '14 at 20:43

3 Answers3

1

Here is a somewhat shorter perl version that works with your sample data.

sub parse {
  local $/ = undef;
  my $text = <>;
  my ($root) = $text =~ /insert_job:\s*(\S+)/;
  my @links = $text =~ /insert_job:\s*(\S+)\s*box_name:\s*(\S+)/g;
  my $children = {}; 
  while (@links) {
    my $child = shift @links;
    my $parent = shift @links;
    push @{$children->{$parent}}, $child;
  }
  my $print = sub {
    my ($print, $parent, $indent) = @_;
    print "\t" x $indent, $parent, "\n";
    $print->($print, $_, $indent + 1) foreach (@{$children->{$parent} || []});
  };
  $print->($print, $root, 0);
}

parse;
Gene
  • 46,253
  • 4
  • 58
  • 96
  • This assumes that the root job is the first one in the file. And it may be shorter than the OP's code, but it could really do with several blank lines. The private subroutine serves only to obscure it further. – Borodin May 27 '14 at 03:21
  • @Borodin Thanks. True on the root assumption. Seemed reasonable that's what the OP intended. Blank lines and certainly the private subroutine are religious. To me this is the clearest way to express the algorithm. It's a special purpose routine that binds the `$children` variable. It uses a common Perl idiom for special purpose routines that bind internal variables. Other languages provide nested subroutines and classes to accomplish exactly the same thing. – Gene May 27 '14 at 15:45
1

This program does what you ask. It expects the path to the input file as a parameter on the command line.

It starts by building a hash relating the name of each job to all the jobs in that box. Jobs that aren't followed by a box name on the next line are pushed onto the list of root jobs. Finally, the recursive subroutine print_tree is called to dump the dependency trees starting at each of the roots.

use strict;
use warnings;

my ($curr_job, %jobs, @roots);
while (<>) {
  next unless my ($op, $id) = /(\w+): ([\w-]+)/;
  if ($op eq 'insert_job') {
    push @roots, $curr_job if $curr_job;
    $curr_job = $id;
    $jobs{$id} = [] unless $jobs{$id};
  }
  elsif ($op eq 'box_name') {
    push @{ $jobs{$id} }, $curr_job;
    $curr_job = undef;
  }
}
push @roots, $curr_job if $curr_job;

print_tree($_) for @roots;

sub print_tree {
  my ($root, $indent) = (@_, 0);
  printf "%s%s\n", ' ' x 4 x $indent, $root;
  print_tree($_, $indent + 1) for @{ $jobs{$root} };
}

output

J1
    J2
        J3
    J4
        J5
        J6

output 2

JS11-LR_BaselIII
    JS11-Check_Batch_Run_Numbers
        11000000-runbox
        11000000-start
    JS11-Load_Session_Date
        JS110000-runbox
        JS110000-start
    JS11-Pool_Processing
        JS110002-start
    JS11-Start_RiskWatch
        JS110004-runbox
        JS110004-start
    JS11-Start_UDS
        JS110001-runbox
        JS110001-start
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thanks. but the code has some bugs: Unmatched [ in regex; marked by <-- HERE in m/= [ <-- HERE ]; } elsif (/ at ./parsejobs.pl line 12. – techie11 May 27 '14 at 14:43
  • @user2723438: My solution works fine as it stands. If there are syntax errors then you must have put them in yourself. – Borodin May 27 '14 at 15:05
  • the tree() function is not a perl buit-in function..."Undefined subroutine &main::tree called at ./parsejobs.pl line 24, <> line 97." – techie11 May 27 '14 at 15:20
  • @user2723438: Yes, that should have been `print_tree` and I've fixed it. But there's no `Unmatched [ in regex` – Borodin May 27 '14 at 15:27
  • Ah, unless you have a *very* old version of Perl and it doesn't like `$jobs{$job} //= []`, in which case you can change that line to `$jobs{$job} = [] unless defined $jobs{$job}` – Borodin May 27 '14 at 15:29
  • my version is: v5.8.4. I replaced $jobs{$job} //= [] line with the alternative, but now the printing never stops... – techie11 May 27 '14 at 15:39
  • @user2723438: I've tested it again myself and it's absolutely fine with the data in your question. Copy the question again and try once more. I don't see how it can loop unless you have cyclic data, like `insert_job: J2`, `box_name: J3`, `insert_job: J3`, `box_name: J2`. Exactly what is it printing? – Borodin May 27 '14 at 17:41
  • Hi Borodin, I did that. the problem is non-stop printing lines. I redirected to a file. for a few seconds, the size of the file grew to 3G and now I saw this error message: "Deep recursion on subroutine "main::print_tree" at ./parsejobs.pl line 24, <> line 97." – techie11 May 27 '14 at 17:55
  • I checked the content of the output, they are repeating "JS14 14000000 14000000" with identation. the file grew to 21G in just a few second. I guess there must be some issue in the print_tree() recursion call. – techie11 May 27 '14 at 19:38
  • Please post your input data file so that I can debug it. You can either put it here if it's a reasonable size or upload it to pastebin.com and let me have a link. – Borodin May 27 '14 at 19:53
  • your code worked for the small test data I originally posted. the issue I was referring to occurred when I tested with another set of test data. I posted it above as test_data2. Thanks. – techie11 May 28 '14 at 02:03
  • Okay it's all fixed. Thanks for the data. The problem was that I wasn't expecting hyphens `-` in the job numbers. – Borodin May 28 '14 at 10:04
0

Using GNU awk for true multi-dimensional arrays:

$ cat tst.awk
/^insert_job/ { job = $2; if (root == "") root = job }
/^box_name/   { box = $2; jobs[box][job] }
END           { prtBox(root) }

function prtBox(box,    job) {
    printf "%*s%s\n", indent, "", box
    indent += 2
    if (box in jobs)
        for (job in jobs[box])
            prtBox(job)
    indent -= 2
}

.

$ awk -f tst.awk file
J1
  J2
    J3
  J4
    J5
    J6
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks Ed. I ran the code and was prompted above syntax error. – techie11 May 28 '14 at 02:09
  • You're using an old version of gawk that doesn't support mufti-dimensional arrays. I'm using gawk 4.1.0, gawk --version will tell you what you're using. – Ed Morton May 28 '14 at 05:23
  • My gawk version is GNU Awk 3.1.7. but nice to know gawk supports multi-dimensional arrays and user-defined functions. thanks. – techie11 May 28 '14 at 14:11
  • You can do the same in older gawks (or non-gawks) by saving the jobs as a string rather than a second array index, and then using split() on that string before the loop in prtBox() but you'd be better off just getting a newer version of gawk as you're missing a lot of great new functionality. – Ed Morton May 28 '14 at 14:31