COLLECT and MODIFY optimization for large itab

Question

I have 2 parts of code. Both of them process 1,5 million records, but the first part is taking 20 minutes and the 2nd part is taking 13,5 hours!!!!!!
Here is the 1st part:

  loop at it_bkpf.
    select * from bseg into corresponding fields of itab
      where bukrs = it_bkpf-bukrs and
            belnr = it_bkpf-belnr and
            gjahr = it_bkpf-gjahr and
            hkont in s_hkont.

      if sy-subrc = 0 .
        itab-budat = it_bkpf-budat.

        clear *bseg .
        select single * from *bseg
          where bukrs = itab-bukrs and
            belnr = itab-belnr and
            gjahr = itab-gjahr and
            hkont = wtax .
        if sy-subrc <> 0 .
          itab-budat = '99991231'.
        endif.
      endif.
      append itab.
    endselect.
  endloop.

The 2nd part which is doing 13,5 hours is the following:

sort itab by belnr.

  loop at itab where hkont(2) = '73'.
    move-corresponding itab to itab2.
    collect itab2.
  endloop.

  loop at itab2.
    lv_5per_total = con_5per_tax * itab2-dmbtr.
    lv_5per_upper = lv_5per_total + '0.02'.
    lv_5per_lower = lv_5per_total - '0.02'.

    read table itab with key belnr = itab2-belnr
                             hkont = wtax.

    if sy-subrc = 0.
      if itab-dmbtr between lv_5per_lower and lv_5per_upper.
        itab-budat = '99991231'.
        modify itab transporting budat where belnr = itab2-belnr.
      endif.
    endif.
  endloop.

Does anyone have an idea on how to fix the 2nd part?
Some extra things:
it_bkpf has 1,5 million records.
After the 1st process ITAB has 1,5 million records.
In the 2nd part in the 1st loop I summ the amounts per belnr for the accounts that start with 73.
In the 2nd loop I compare the sum per belnr with the amount of the belnr/account and do what the code says.
Thanks

ADDITIONAL INFORMATION:
1st of all the initial code existed and I added the new one. ITAB existed and ITAB2 is mine. So the declaration of the tables was:

DATA : BEGIN OF itab OCCURS 0,
        bukrs LIKE bseg-bukrs,
        hkont LIKE bseg-hkont,
        belnr LIKE bkpf-belnr,
        gjahr LIKE bkpf-gjahr,
        dmbtr LIKE bseg-dmbtr,
        shkzg LIKE bseg-shkzg ,
        budat LIKE bkpf-budat,
        zzcode LIKE bseg-zzcode.
DATA END OF itab.
DATA : BEGIN OF itab2 OCCURS 0 ,
        belnr LIKE bkpf-belnr,
        dmbtr LIKE bseg-dmbtr,
       END OF itab2.

After your suggestion I made the following changes:

types: begin of ty_belnr_sums,
        belnr like bkpf-belnr,
        dmbtr like bseg-dmbtr,
       end of ty_belnr_sums.
data: git_belnr_sums type sorted table of ty_belnr_sums
                                    with unique key belnr.
data: gwa_belnr_sums type ty_belnr_sums.

  data: lv_5per_upper type p decimals 2,
        lv_5per_lower type p decimals 2,
        lv_5per_total type p decimals 2.

  sort itab by belnr hkont.

  loop at itab where hkont(2) = '73'.
    move-corresponding itab to gwa_belnr_sums.
    collect gwa_belnr_sums into git_belnr_sums .
  endloop.

  loop at git_belnr_sums into gwa_belnr_sums.
    lv_5per_total = con_5per_tax * gwa_belnr_sums-dmbtr.
    lv_5per_upper = lv_5per_total + '0.02'.
    lv_5per_lower = lv_5per_total - '0.02'.

    read table itab with key belnr = gwa_belnr_sums-belnr
                             hkont = wtax
                    binary search.

    if sy-subrc = 0.
      if itab-dmbtr between lv_5per_lower and lv_5per_upper.
        itab-budat = '99991231'.
        modify itab transporting budat
                        where belnr = gwa_belnr_sums-belnr.
      endif.
    endif.
  endloop.

Now I am running in the background for 1,5 millions records and it continues after 1 hour.

difficult to say why the second part takes longer, there are several possibilities depending on the data in itab. First you're sorting itab by belnr, then you're filtering itab by a substring. If the filter is effective and removes most of the records, i'd sort after filtering, not before. And if you can filter by field hkont, why don't you filter while building the itab in the first place (1st part of your code), instead of first building the itab with all records and then only transfer those you need into itab2. — Dirk Trilsbeek, Mar 02 '18 at 08:00
and you should be able to reduce the running time of your first code snippet by avoiding select...endselect. That really kills performance when you process 1.5 million rows. — Dirk Trilsbeek, Mar 02 '18 at 08:02
Curiously, the performance trouble is in the second half, where he's got no SELECT...ENDSELECT (although you're right about it) — VXLozano, Mar 02 '18 at 10:53
Please, you should have clearly state from the beginning that your question was only about part 2, and part 1 is only here for our information. I see people answering about part 1 and spending time for it. — Sandra Rossi, Mar 02 '18 at 20:09

József Szikszai · Answer 1 · 2018-03-02T21:58:50.173

0

There are several performance issues here:

1st part:

Never do SELECT in LOOP. I said never. You should do instead a SELECT ... FOR ALL ENTRIES or SELECT ... JOIN (BSEG is cluster table in ECC so no JOIN possible, but transparent table in S4/HANA, so you can JOIN it with BKPF). As far as I see from the coding, you select G/L items, so you can check if using BSIS/BSAS would be better (instead of BSEG). You can also check table BSET (looks like you are interested in "tax" lines)

2nd part:

If you do a LOOP with WHERE condition, you get the best performance if the internal table is TYPE SORTED (with proper key). However your WHERE condition uses offset, so I am not sure, if the SORTED table will help, but it is worth a try.

However the real issue is here: COLLECT itab2. COLLECT is not supported for TYPE STANDARD tables (this is clearly stated in the SAPHelp), the internal table has to be TYPE HASHED (there are technical reasons behind, I am not going into details).

After that you LOOP one internal table (itab2) and READ TABLE the other (itab). To have the best performance for the single line read, itab has to be TYPE HASHED (or at least TYPE SORTED with the proper key).

Instead of using internal tables with HEADER LINES, you have to LOOP ... ASSIGNING FIELD-SYMBOL(<...>), so you don't need the MODIFY anymore, this will also slightly improve the performance.

edited Mar 02 '18 at 21:58

answered Mar 02 '18 at 08:32

József Szikszai

4,791
3
14
24

I'm not sure about the last paragraph, because it READs the table with TWO conditions, but MODIFYes it just for ONE. So I assumed he wants to update all rows in that table for that single condition. – VXLozano Mar 02 '18 at 10:32
You could be right, I might overlooked that one. He LOOPs itab2, than READs single line in itab and (if conditions apply) modifies several lines for itab. In this case a LOOP over a TYPE SORTED table will give a much better performance (with proper key). Another reason to use sensible internal table names and not like itab, itab2, etc. (Clean Code!) – József Szikszai Mar 02 '18 at 10:45
1

On the other hand, my advice using FIELD SYMBOLS instead of work areas gives better performance (probably about 10% improvement acc. to my experience). – József Szikszai Mar 02 '18 at 10:47
1

The answer starts quite well, but later there are several problems: COLLECT is absolutely supported for standard tables, only a _temporary_ hash is generated. Second, the type of itab2 does not matter in the LOOP, the slowness is caused by itab, and _that_ should be hashed. – András Mar 02 '18 at 16:54
@András totally agree, except for making `itab` **hashed** because (after the latest edit by ekekakos), itab has a theoritical unique key by bukrs, belnr, gjahr and hkont probably, but the access is by the partial key bukrs, belnr and gjahr (no hkont). – Sandra Rossi Mar 02 '18 at 21:23
@András sorry, I am not native English and therefore my choice of words might be incorrect. When I wrote COLLECT is not supprted for STANDARD tables, I did not mean, it won't work at all, only meant there will be huge performance issues (if the table is big) In the menatime I checked in SAPHelp and this is how it formulated there: "The statement COLLECT is not suitable for standard tables and should no longer be used for them." (https://help.sap.com/doc/abapdocu_751_index_htm/7.51/en-US/abapcollect.htm) – József Szikszai Mar 02 '18 at 21:42
@András "only a temporary hash is generated": this is correct, but only half of the truth. If you COLLECT aa STANDARD table nd no matching line will be found, ABAP does an APPEND (internally), the temporary hash is destroyed and has to be built up again by the next COLLECT - this is why there could be serious performance issues by COLLECTing huge STANDARD tables – József Szikszai Mar 02 '18 at 21:55
@András "Second, the type of itab2 does not matter in the LOOP, the slowness is caused by itab, and that should be hashed" - thanks for pointing out, I corrected that (at the end I was confused with the code and the internal table names :) ) – József Szikszai Mar 02 '18 at 21:59
@JozsefSzikszai, én sem vagyok angol :) – András Mar 02 '18 at 23:47
@JozsefSzikszai, the hash in the COLLECT is only destroyed, if you _change_ the table. Appending is the essence of collect, of course it works. 3 paragraphs after the one you cite, it is explained that only edits must be avoided. – András Mar 02 '18 at 23:48

score 0 · Answer 2 · edited Mar 03 '18 at 07:08

As said by JozsefSzikszai, it would probably better to use BSAS / BSIS as they have indexes. If you really want to use BSEG : BSEG is a cluster table. AFAIK, it is preferable for those tables to perform a SELECT using the key fields to get an internal table, and then work on the internal table for additionnal fields

 select * from bseg into corresponding fields of itab
      for all entries in it_bkpf
      where bukrs = it_bkpf-bukrs and
            belnr = it_bkpf-belnr and
            gjahr = it_bkpf-gjahr.
 delete itab where hkont not in s_hkont.

In the second part, you're performing a READ TABLE on a standard table (itab). Using a BINARY search would cut greatly in your time... A read table on a standard table is a full table scan (the lines are read till we find the value). Sorting itab by belnr and hkont (instead of only belnr) and addinf BINARY SEARCH will make it a dichotomice search.

Removing some restricting component from the WHERE condition just to delete these lines later is a _very_ bad idea. — András, Mar 03 '18 at 19:47
I would normally agree, but in case of cluster table, we gained perf doing just this.(for a field that was not in the key) — PATRY Guillaume, Mar 04 '18 at 17:55

score 0 · Answer 3 · answered Mar 02 '18 at 21:17

I guess all people gave you all the hints to optimize your code. Essentially, as you want to optimize only the 2nd part, the only issues seem to be with itab operations (loop, read, modify), that you may improve by using an index or a or a hash table on the internal table itab. If these concepts are unclear, I recommend the ABAP documentation: Row-Based Administration Costs of Internal Tables and Internal Tables - Performance Notes

Two more things:

you should measure the performance of your code with transaction SAT, and you will see on which operation the time is spent.
Not related to the performance but on the data model: the key of a financial document is made of 3 fields, company (BUKRS), document number (BELNR) and year (GJAHR). So, you shouldn't group ("collect") by document number only, or you risk to mix documents of distinct companies or years. Instead, keep the 3 fields.

Now, if you accept to adapt a few little things in the 1st part, right after reading the lines of BSEG of every document, then you may simply loop at these lines, no need of an index or binary search or hash table, and that's done. The only thing to do is to first store the lines of BSEG in a temporary itab_temp, to contain only the lines of one document at a time, so that you loop only at these lines, and then add them to itab.

If you want to update the BUDAT component later, not in the 1st part, then store the document key + BUDAT in a distinct internal table (say itab_budat), and define a non-unique secondary index on itab with the document key fields (say by_doc_key), and you'll only need this straight forward code :

LOOP AT itab_budat ASSIGNING <line_budat>.
  MODIFY itab FROM itab USING KEY by_doc_key TRANSPORTING budat
       WHERE bukrs = <line_budat>-bukrs
         AND belnr = <line_budat>-belnr
         AND gjahr = <line_budat>-gjahr.
ENDLOOP.

Note that I propose a (non-unique) secondary index, not a primary one, because as you don't want to change (too much) the 1st part, this will avoid potential issues that you might have with a primary index.

If you need more information on the terms, syntaxes, and so on, you'll find them in the ABAP documentation (inline or web).

András · Answer 4 · 2018-03-03T22:10:27.317

The first is faster because it is just linearly bad

The second one is quadratic.
The first one looks quadratic too, but bukrs, belnr, gjahr are almost the complete primary key of bseg, so for every line of it_bkpf there not many lines found by the SELECTs.

First part

As JozsefSzikszai wrote, FOR ALL ENTRIES is much faster than nested SELECTs. Moreover, huge amounts of unnecessary data is moved from the DB to the application server, as the SELECT SINGLE is just an existence check, still it reads all 362 colums of BSEG.

So create a SORTED table before the LOOP: TYPES: BEGIN OF tt_bseg, bukrs TYPE bukrs, belnr TYPE belnr, gjahr TYPE gjahr, END OF tt_bseg.

DATA: lt_bseg2 TYPE SORTED TABLE OF tt_bseg WITH NON-UNIQUE KEY bukrs belnr gjahr.

SELECT bukrs belnr gjahr FROM bseg
    INTO TABLE lt_bseg
    FOR ALL ENTRIES IN it_bkpf
    WHERE bukrs = it_bkpf-bukrs
      AND belnr = it_bkpf-belnr
      AND gjahr = it_bkpf-gjahr
      AND hkont = wtax.

Instead of the SELECT SINGLE, use READ TABLE:

READ TABLE lt_bseg TRANSPORTING NO FIELDS
    WITH KEY bukrs = itab-bukrs
             belnr = itab-belnr
             gjahr = itab-gjahr.

Second part

This is quadratic because for every iteration of itab2, all lines of itab are checked for the MODIFY.

Simplest solution: Instead of SORT itab BY belnr load the content of itab into a SORTED table, and use that later instead:

DATA: lt_itab TYPE SORTED TABLE OF itab 
            WITH NON-UNIQUE KEY belnr hkont.
lt_itab = itab.
FREE itab.

Suncatcher · Answer 5 · 2018-05-26T15:15:27.133

There were many reasonable recommendations from guys, especially from Sandra Rossi, but I'll just add my two cents to your latest variant that runs 1 hour:

use field-symbols as much as possible. They really can influence performance on large datasets (and 1.5M is large, yes!)
use secondary keys, as proposed by Sandra. It really matters a lot (see standard program DEMO_SECONDARY_KEYS)
do not use INTO CORRESPONDING FIELDS unless needed, it slows down the queries a bit
some doublework in your snippet, as you collect unnecessary rows in first loop by itab (BSEG), but BUDAT is written only when BSEG position sum fits the lower-upper range. So some collected sums are wasting time.

All in all, I haven't fully understood your logic, as you write BUDAT not to the whole matched rowset but only to the first row of the group with matched BELNR, which has no sense, but nevertheless.

Nevertheless, if one try to just repeat you logic without any alterations and apply a new GROUP BY syntax, here is what one might get.

* To omit `INTO CORRESPONDING` you should properly declare your structure
* and fetch only those fields which are needed. Also avoid obsolete `OCCURS` syntax.
TYPES: BEGIN OF ty_itab,
        bukrs TYPE bseg-bukrs,
        belnr TYPE bkpf-belnr,
        gjahr TYPE bkpf-gjahr,
        hkont TYPE bseg-hkont,
        dmbtr TYPE bseg-dmbtr,
        shkzg TYPE bseg-shkzg ,
        budat TYPE bkpf-budat,
      END OF ty_itab.

TYPES: begin of ty_belnr_sums,
        belnr type bkpf-belnr,
        dmbtr type bseg-dmbtr,
       end of ty_belnr_sums.
DATA: itab TYPE TABLE OF ty_itab INITIAL SIZE 0,
      con_5per_tax type p decimals 2 value '0.03'.

SELECT g~bukrs g~belnr g~gjahr g~hkont g~dmbtr g~shkzg f~budat UP TO 1500000 rows
INTO table itab
FROM bseg AS g
JOIN bkpf AS f
 ON g~bukrs = f~bukrs
AND g~belnr = f~belnr
AND g~gjahr = f~gjahr.

DATA members LIKE itab.
LOOP AT itab ASSIGNING FIELD-SYMBOL(<fs_itab>)
                       GROUP BY ( belnr = <fs_itab>-belnr )
                                  hkont = <fs_itab>-hkont )
                       ASCENDING
                       ASSIGNING FIELD-SYMBOL(<group>).
  CLEAR members.
  CHECK <group>-hkont(2) = '73'.

  LOOP AT GROUP <group> ASSIGNING FIELD-SYMBOL(<group_line>).
    members = VALUE #( BASE members ( <group_line> ) ).
  ENDLOOP.

  DATA(sum) = REDUCE dmbtr( INIT val TYPE dmbtr
                            FOR wa IN members
                            NEXT val = val + wa-dmbtr ).

  IF members[1]-dmbtr BETWEEN con_5per_tax * sum - '0.02' AND con_5per_tax * sum + '0.02'.
    <first_line>-budat = '99991231'.
  ENDIF.
ENDLOOP.

During my test on 1.5M dataset I measured the runtime with GET RUN TIME FIELD and got following results.

Old snippet:

My snippet:

COLLECT and MODIFY optimization for large itab

5 Answers5

The first is faster because it is just linearly bad

First part

Second part