2

I want to use an external table to load a csv file as it's very convenient, but the problem is how do i make sure i don't load the same file twice in a row? i can't validate the data loaded because it can be the same information as before; i need to find a way to make sure the user doesnt load the same file as 2h ago for example. I thought about uploading the file with a different name each time and issuing an alter table command to change the name of the file in the definition of the external table, but it sounds kinda risky. I also thought about marking each row in the file with a sequence to help differentiate files, but i doubt the client would accept it as they would need to manually do this (the file is exported from somewhere).

Is there any better way to make sure i don't load the same file in the external table except changing the file's name and executing an alter on the table?

Thank you

mardeea
  • 45
  • 1
  • 7
  • 1
    How do you identify 'same file' based just on file name - not same content? Is [capturing the filename](http://stackoverflow.com/a/29106343/266304) useful? (You could do the same thing to add your sequence). Alternatively you could maybe use a preprocessor directive to find an unprocessed file and rename it to a standard name your external table expects. Then a subsequent query will look for a different file. Would need some work but might be feasible. – Alex Poole Nov 20 '15 at 10:26
  • the client (a bank) says they can load the same data in the morning and in the afternoon and that would be correct. but they need to make sure they aren't loading by mistake the same file so they'd like to have different names for the files – mardeea Nov 20 '15 at 16:11
  • 1
    If they are supplying files with different names, and you only want to look at each file once, then a preprocessor that renames a waiting file (maybe oldest/newest?) to what your external table expects might work then. The second time you look at it it would look for a different file, so you wouldn't see the same one again. The earlier answer might give a starting point. Unfortunately I don't have time to put something together at the moment... – Alex Poole Nov 20 '15 at 16:21

3 Answers3

1

I can only think of a solution somewhat like this:

  1. Have a timestamp encoded in the datafile name (like: YYYYMMDDHHMISS-file.csv), where YYYYMMDDHHMISS is the timestamp.
  2. Create a table with the fields timestamp (as above).
  3. Create a shell scripts that:
    • extracts the timestamp from the datafilename.
    • calls an sqlscript with the timestamp as the parameter, and return 0 if that timestamp does not exist, <>0 if the timestamp already exist, and in that case exit the script with the error: File: YYYYMMDDHHMISS-file.csv already loaded.
    • copy the YYYYMMDDDHHMISS-file.csv to input-file.csv.
    • run the sql loader script that loads the input-file.csv file
    • when succes: run a second sql script with parameter timestamp that inserts the record in the database to indicate that the file is loaded and move the original file to a backup folder.
    • when failure: report the failure of the load script.
Rob Heusdens
  • 155
  • 5
0

when you bring the data from external table to your database you can use MERGE command instead of insert. it let you don't worry about duplicate data

see the blog about The Oracle Merge Command

What's more, we can wrap up the whole transformation process into this one Oracle MERGE command, referencing the external table and the table function in the one command as the source for the MERGED Oracle data.

    alter session enable parallel dml;
merge /*+ parallel(contract_dim,10) append */
    into contract_dim d
    using TABLE(trx.go(
        CURSOR(select /*+ parallel(contracts_file,10) full (contracts_file) */ *
            from contracts_file ))) f
    on  d.contract_id = f.contract_id 
    when matched then
        update set desc              = f.desc,
                   init_val_loc_curr = f.init_val_loc_curr,
                   init_val_adj_amt  = f.init_val_adj_amt
    when not matched then
        insert values ( f.contract_id,
                        f.desc,
                        f.init_val_loc_curr,
                        f.init_val_adj_amt);  

So there we have it - our complex ETL function all contained within a single Oracle MERGE statement. No separate SQL*Loader phase, no staging tables, and all piped through and loaded in parallel

are
  • 2,535
  • 2
  • 22
  • 27
  • 1
    although you try to address an [xy problem](http://meta.stackexchange.com/q/66377/168509), I think this is not the OP's problem. It seems he is ok with duplicate records, just duplicate files is a problem. – Florin Ghita Nov 20 '15 at 11:20
  • yes, florin is right, duplicate data is not a problem, it might happen. but i need to make sure the user doesnt load the same file twice in a row by mistake – mardeea Nov 20 '15 at 16:07
0

Since its not optimal to use directly the csv (from test.csv) table (test.tbl), I made a process table (test_process.tbl) from it, and a daily job checks the data, if it is in the process table (with minus syntax) and after checking I move the file to a /processed folder with the current date (/processed/test_2023_03_27.csv).

I use a procedure, but for creating the table from csv, one must use 'execute immediate', or it is going to be invalid.

babicsp
  • 33
  • 4