0

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.

High-level pseudo-code:

  1. Collect request via db proc (query)
  2. Call API with request
  3. Store results of the request (response)

I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).

Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?

I am wondering how to do this without adding the column to source table A.

What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.

I have a stored proc to handle batch selection:

/*
 *      getBatch.sql
 *
 *      SYNOPSIS:  Retrieve the next set of criteria to be used in a search
 *                 request.  Use left outer join between input source table
 *                 and results table to determine the next set of inputs, and
 *                 provide support so that concurrent processes may call this
 *                 proc and get their inputs exclusively.
 */
alter procedure "ACOX"."getBatch" (
     in in_limit int
    ,in in_run_group_id varchar(36)
    ,out ot_result table (
         id bigint
        ,runGroupId varchar(36)
        ,sourceTableRefId integer
        ,name nvarchar(22)
        ,location nvarchar(13)
        ,regionCode nvarchar(3)
        ,countryCode nvarchar(3)
    )
) language sqlscript sql security definer as
begin       

    -- insert new records:
    insert into "ACOX"."search_result_v4" (
         "RUN_GROUP_ID"
        ,"BEGIN_DATE_TS"
        ,"SOURCE_TABLE"
        ,"SOURCE_TABLE_REFID"   
    )
    select
         in_run_group_id as "RUN_GROUP_ID"
        ,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
        ,'acox.searchCriteria' as "SOURCE_TABLE"
        ,fp.descriptor_id as "SOURCE_TABLE_REFID"
    from 
        acox.searchCriteria fp
    left join "ACOX"."us_state_codes" st
        on trim(fp.region) = trim(st.usps)
    left outer join "ACOX"."search_result_v4" r
        on fp.descriptor_id = r.source_table_refid
    where
        st.usps is not null
        and r.BEGIN_DATE_TS is null
    limit :in_limit;
    
    -- select records inserted for return:
    ot_result =
    select
         r.ID id
        ,r.RUN_GROUP_ID runGroupId
        ,fp.descriptor_id sourceTableRefId
        ,fp.merch_name name
        ,fp.Location location
        ,st.usps regionCode
        ,'USA' countryCode
    from 
        acox.searchCriteria fp
    left join "ACOX"."us_state_codes" st
        on trim(fp.region) = trim(st.usps)
    inner join "ACOX"."search_result_v4" r
        on fp.descriptor_id = r.source_table_refid
        and r.COMPLETE_DATE_TS is null
        and r.RUN_GROUP_ID = in_run_group_id
    where
        st.usps is not null
    limit :in_limit;

end;

When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.

Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.

Useful references:

Adam Cox
  • 3,341
  • 1
  • 36
  • 46
  • 7 mins. to process 5000 records is way too slow. Not sure why there would be the need to actually store data in the "search_results_v4" table when this appears to be a transient list of records. – Lars Br. Nov 01 '20 at 23:40
  • 1
    @LarsBr. I update my question to give context for this timing. Db has the source of requests for API call. Response is stored for latest analysis. The API call time is the greatest cost. – Adam Cox Nov 02 '20 at 03:37

1 Answers1

1

First off: there is no "read-only-once" in any RDBMS, including MS SQL. Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)

I assume that that is not what you are looking for.

Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.

This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).

Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.

The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup). The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.

Both of these potential issues can be addressed with the following approach:

  1. SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .

  2. Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...

  3. As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter. One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs). In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.

  4. COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.

As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process). Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.

Now, why is this approach preferable to making records "un-readable"? Because of the other processes in the system.

Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.

Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.

Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.

Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

Lars Br.
  • 9,949
  • 2
  • 15
  • 29
  • Thanks Lars. This is a good read, and pretty close to my approach. I will come back and update the question with some code. One of the things I am struggling with is the syntax. – Adam Cox Oct 30 '20 at 03:03
  • `First off: there is no "read-only-once" in any RDBMS, including MS SQL` also curious how OP used/implemented read-only-once on MSSQL :) – Suncatcher Oct 30 '20 at 12:04
  • @Suncatcher I have not done this. However, I believe it is possible with the right tech. For example, using the kafka backed Confluent platform. https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/ – Adam Cox Oct 30 '20 at 13:49
  • @Lars I try and search for documentation on how exactly to use the `FOR UPDATE` clause, and found this SO post with broken link. Can you point me to the syntax? I find the SAP HANA reference documentation sparse with regards to examples. Of course, I will go away and experiment but I think documentation needs to improve in order to short-cut my experimentation efforts, yes? Thank you. https://stackoverflow.com/questions/26525489/is-a-lock-obtained-with-select-for-update-released-when-the-stored-procedure-e – Adam Cox Oct 30 '20 at 14:33
  • Another approach that I've seen in ETL tools (IBM Data Stage, for example, or SAP BW transformations) is implement read partitioning: on table partition, on integer column by getting a modulo of number divided by num_readers, on rowid range etc, depending what does the table have. So this is an option to disable reading the record twice. Then you can mark it as processed or not and consider it in the next batch session depending on status. – astentx Oct 30 '20 at 14:37
  • Hey @AdamCox: no worries: the `FOR UPDATE` clause is part of the `SELECT` documentation: https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.04/en-US/20fcf24075191014a89e9dc7b8408b26.html And you're right: pub/sub systems like [KAFKA] can support different message delivery semantics and, in fact, are used for data loading with HANA, see e.g. https://blogs.sap.com/2020/10/30/kafka-to-hana-cloud/ – Lars Br. Oct 31 '20 at 03:29