0

I have data files in text format which have several rows. Now there are certain rows that have wrong data which I need to update with those that have the correct data. For example,

Col1  Col2  Col3  Col4 .......
A1?%     A     foo  fooo .......
B€(2     B     .................  
C&6     Z     .................
A?04     Y     .................
B++3     Q     .................
C!5     C     .................
D*9     D     .................

The actual data is different but this is a simplified version of it. As you can see there are certain Col1 where A1 is A but A4 is Y and so on. The rest of the columns Col3, Col4 ... depend on Col2. So, I need to check if Col2 is A when there is an A in Col1 (A1, A2, A3 etc). If not I have to update Col2, Col3 .... based on the row where it is A.

How may this be accomplished in Perl. I know this kind of operations can be done in an database with an update statement but I don't have that luxury here and have to do it programatically.

Edit: The files are tab delimited and the data are strings that can contain any alphanumeric or ascii character.

sfactor
  • 12,592
  • 32
  • 102
  • 152
  • Please clarify: Is the data stored in a space-separated file? Can the cells only contain alphanumeric characters? Does A1 have a higher precedence than A4 because its number is lower, or because its row is higher up? – Tim Feb 07 '11 at 16:39
  • I think you think you're being somewhat clear, but you say you could to this in a database, perhaps if you posted the analogous SQL we could get a better idea of 1) what you were trying to update, 2) with what, 3) in what conditions. – Axeman Feb 07 '11 at 17:20
  • @Tim I have added the info that I seem to have forgotten to add. The cells may contain any character and A1 has higher precendence because I know Col2 is A in that case, so its bit of a manual lookup. – sfactor Feb 08 '11 at 07:27
  • What happens if the data contain a TAB character? – reinierpost Feb 08 '11 at 08:49
  • So is B supposed to be B or Q? Is C supposed to be Z or C. You say "manual lookup" - how do you propose to do a manual lookup programatically? – Dennis Williamson Feb 08 '11 at 15:04

4 Answers4

2

The way I would do this is to open an input file handle and an output file handle, and go line by line through the file checking column one and, if its fine, just plop it into my output just as it is.

If it does need to change, I would make a new line with the necessary changes and put it into my output file as well.

This is a simple approach, that while not the greatest/elegant/whatever, would give you what you need quickly.

Weegee
  • 2,225
  • 1
  • 17
  • 16
  • You would need some way to store the correct values you want to insert as you go through the file and you may have not seen all the correct values. – Girish Rao Feb 07 '11 at 16:47
  • @Girish Sorta. It sounds like given col2 is X col3 and col4 are the same. So X1, X2, and X3 would all have the same col3 and col4. All he would need is to look at col1, then he could select the correct info from a hash data structure for col2, col3, and col4. – Weegee Feb 07 '11 at 17:03
1

Populate a hashmap where the key is Col2 (A,B,C, etc) and the value is the rest of the columns (Col3, Col4, etc). Only make Col2 the key if Col1 and Col2 match as you want.

Then when writing out the file if Col1 and Col2 do not match, do a lookup in the hash on the first character of Col1. This will get you the Col3, Col4... values to insert.

Girish Rao
  • 2,609
  • 1
  • 20
  • 24
1

Use a CSV processor!

At least Text::CSV or relatives like Text::CSV_XS (faster) or Text::CSV::Encoded (e.g. for UTF-8).

DBD::CSV provides SQL.

reinierpost
  • 8,425
  • 1
  • 38
  • 70
0

Below is a skeleton of a basic program structure to allow you to do this. If I knew what you wanted to do I could be a lot more helpful.

I had made the easiest guess possible, and I treated your input files as if they were fixed-column with widths=7,6,*. As you have since informed me that they are tab-delimited, I have changed the code that breaks up the data into fields.

use autodie;
use strict;
use warnings;
use English qw<$INPUT_LINE_NUMBER>;

my %data;
my $line_no;
open ( my $h, '<', 'good_file.dat' );

while ( <$h> ) {
    my ( $col1, $col2, $data ) = split( /\t+/, $_, 3 );
    # next unless index( $col1, 'A' ) == 0;
    $line_no = $INPUT_LINE_NUMBER;
    my $rec 
        = { col1 => $col1
          , col2 => $col2
          , data => $data
          , line => $line_no
          };
    push( @{ $data{"$col1-$col2"} }, $rec );
    $data{ $line_no } = $rec;
}
close $h;

open ( $h, '<', 'old_file.dat' );

while ( <$h> ) { 
    my ( $col1, $col2, $data ) = split( /\t+/, $_, 3 );
    ... 
}

The following is just a way you could print your values back into the file.

open ( $h, '>', 'old_file.dat' );
foreach my $rec ( grep {; defined } @data{ 1..$line_no } ) { 
    printf $h "%s\t%s\t%s\n", @$rec{qw<col1 col2 data>};
}

But you really haven't given anyone enough help to help you.

Axeman
  • 29,660
  • 2
  • 47
  • 102
  • I should have been more clear and have added some more info that may make it clearer. The columns can have any characters so for instance when I say looking for A in A1, A2, A3 it is more like searching for that substring in the longer string for a match in order to update. – sfactor Feb 08 '11 at 07:30
  • @sfactor, Okay, I can modify the code to be better able to capture this data, but you still have not given any sort of clear indication of conditions and modifications. – Axeman Feb 08 '11 at 19:05
  • I've actually created another question hoping to be more clear regarding the question. Still it seems it isn't clear. But I guess it'll make the question a bit more clear.http://stackoverflow.com/questions/4931783/updating-a-row-in-a-data-file-with-values-from-another-row – sfactor Feb 08 '11 at 20:59