0

I am using a shell script to extract the data from 'extr' table. The extr table is a very big table having 410 columns. The table has 61047 rows of data. The size of one record is around 5KB.

I the script is as follows:

#!/usr/bin/ksh

sqlplus -s \/ << rbb
set pages 0
set head on
set feed  off
set num 20
set linesize 32767
set colsep |
set trimspool on
spool extr.csv
select * from extr;
/
spool off
rbb
#-------- END ---------

One fine day the extr.csv file was having 2 records with incorrect number of columns (i.e. one record with more number of columns and other with less). Upon investigation I came to know that the two duplicate records were repeated in the file. The primary key of the records should ideally be unique in file but in this case 2 records were repeated. Also, the shift in the columns was abrupt. Small example of the output file:

5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5003|A7A|AAB|249.67|AAB|153.33|205|R
5004|A8A|269|F
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F

Here the primary key records for 5003 and 5004 have reappeared in place of 5007 and 5008. Also the duplicate reciords have shifted the records of 5007 and 5008 by appending/cutting down their columns.

Need your help in analysing why this happened? Why the 2 rows were extracted multiple times? Why the other 2 rows were missing from the file? and Why the records were shifted? Note: This script is working fine since last two years and has never failed except for one time (mentioned above). It ran successfully during next run. Recently we have added one more program which accesses the extr table with cursor (select only).

Sandy
  • 1
  • Did somebody start the script while the first was running using the same output file? Was the partition with extr.csv full? Something restarted? Did anything strange happen in the environment? – Walter A Feb 27 '15 at 21:22
  • In this case, 5 scripts are run in parallel by a wrapper script. All the scripts are triggered in the background and the wrapper waits till all are complete. 3 out of 5 are shell scripts which extract data using 'select' query and remaining 2 are COBOL programs. One COBOL program reads the data from 'extr' by declaring a cursor on it. The wrapper script is scheduled using TWS scheduler. No other script was run in parallel to wrapper. There was enough space available in the file system. Nothing was restarted. The environment was normal and nothing strange reported on the day of incident. – Sandy Mar 02 '15 at 07:34
  • Hi @WalterA Could you please provide some inputs based on my response? – Sandy Mar 14 '15 at 07:50

1 Answers1

0

I reproduced a similar behaviour.

;-> cat input
5001|A1A|AAB|190.00|105|A                                    
5002|A2A|ABB|180.00|200|F                                    
5003|A3A|AAB|153.33|205|R                                    
5004|A4A|ABB|261.50|269|F                                    
5005|A5A|AAB|243.00|258|G                                    
5006|A6A|ABB|147.89|154|H                                    
5009|A9A|AAB|368.00|358|S                                    
5010|AAA|ABB|245.71|215|F                                    

See the input file as your database.
Now I write a script that accesses "the database" and show some random freezes.

;-> cat writeout.sh
# Start this script twice                                    
while IFS=\| read a b c d e f; do
        # I think you need \c for skipping \n, but I do it different one time
        echo "$a|$b|$c|$d|" | tr -d "\n"
        (( sleeptime = RANDOM % 5 ))
        sleep ${sleeptime}
        echo "$e|$f"
done < input >> output

EDIT: Removed cat input | in script above, replaced by < input

Start this script twice in the background

;-> ./writeout.sh &
;-> ./writeout.sh &

Wait until both jobs are finished and see the result

;-> cat output 
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|200|F
5003|A3A|AAB|153.33|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|215|F
154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F

When I edit the last line of writeout.sh into done > output I do not see the problem, but that might be due to buffering and the small amount of data.

I still don't know exactly what happened in your case, but it really seems like 2 progs writing simultaneously to the same script.
A job in TWS could have been restarted manually, 2 scripts in your masterscript might write to the same file or something else.
Preventing this in the future can be done using some locking / checks (when the output file exists, quit and return errorcode to TWS).

Walter A
  • 19,067
  • 2
  • 23
  • 43
  • Thanks @Walter. There are no processes that write into the output file at the same time. What I feel is; the sqlplus might have faced inconsistant buffering/locking/pipe-breaking issues while writing onto AIX file-system. – Sandy Mar 16 '15 at 14:40