0

In our database(Oracle 12c, Exadata) we plan to store sales data. Input text files containing sales data comes daily basis(~1000 files every day each containing ~20000 rows). Text files are read and transfered to db asap. Acccording to our calculations it will grow up to 5 terabytes in one year.

Data format: [transaction date][category][sales_number][buyer_id][other columns]

sales data comes in 10 different categories with same fields. Data logically can be stored in just one single table or can be divided into 10 tables (with respect to categories).

What is the best practice for storing such kind of big data in oracle? What kind of partitioning and indexing strategy should be applied?

Constraints: Data should be available for analysis in 2-3 days to marketing department. Queries based on [sales_number] or [category],[buyer_id] or [buyer_id] columns.

fatihn
  • 137
  • 1
  • 8

1 Answers1

2

If the number of categories is known and fix then you can use a subpartition for each category. One approach could be this one:

CREATE TABLE SALES_DATA
(
  TRANSACTION_DATE TIMESTAMP(0) NOT NULL,
  CATEGORY NUMBER NOT NULL,
  SALES_NUMBER NUMBER, 
  BUYER_ID NUMBER,
  [OTHER COLUMNS]
)
PARTITION BY RANGE (TRANSACTION_DATE) INTERVAL (INTERVAL '1' DAY)
SUBPARTITION BY LIST (CATEGORY)
SUBPARTITION TEMPLATE 
    (
        SUBPARTITION CAT_1 VALUES (1), 
        SUBPARTITION CAT_2 VALUES (2),
        SUBPARTITION CAT_3_AND_4 VALUES (3,4), 
        SUBPARTITION CAT_5 VALUES (5),
        ...
        SUBPARTITION CAT_10 VALUES (10),
        SUBPARTITION CAT_OTHERS VALUES (DEFAULT)        
    )
    (
        PARTITION P_INITIAL VALUES LESS THAN (TIMESTAMP '2018-01-01 00:00:00')  
    );

Local indexes would be needed on sales_number and buyer_id. You can put every (sub)partition into a separated tablespace if required.

Wernfried Domscheit
  • 54,457
  • 9
  • 76
  • 110
  • With many categories that would make queries on sales_number or buyer_id quite slow as Oracle would have to scan indexes for all partitions in the date range. If no date has been given all partitions must be processed. – ewramner Jan 11 '18 at 14:58
  • @user2612030, yes that's right but it depends on number of categories. Global index on sales_number would be better. – Wernfried Domscheit Jan 11 '18 at 15:39
  • Without knowing the queries or the cardinality of the data, how do you know that you even need partitioning or indexes? – BobC Jan 18 '18 at 04:55
  • @BobC, the OP stated some numbers and type of queries. – Wernfried Domscheit Jan 18 '18 at 06:41
  • @WernfriedDomscheit Maybe I'm being dumb... where? – BobC Jan 18 '18 at 10:16
  • @BobC, *"Input text files containing sales data comes daily basis(~1000 files every day each containing ~20000 rows)."* and *"in 10 different categories"* and *"Queries based on [sales_number] or [category],[buyer_id] or [buyer_id] columns."* I think this information justifies my proposal. Maybe not by 100% but it should be a good starting point. – Wernfried Domscheit Jan 18 '18 at 10:35
  • But that does not really define the query nor the cardinalities. Maybe I'm just being a bit pedantic, since my focus is on performance ;) – BobC Jan 18 '18 at 16:55