I have a job in Pentaho Data Integration that consists of a simple transformation involving a table input and a table output.
The query in the table input is as follows:
SELECT --TOP 1000
IL.Invoice_Date
,MONTH(IL.Invoice_Date) AS Invoice_Month
,YEAR(IL.Invoice_Date) AS Invoice_Year
,IH.Invoice_Status
,IH.Invoice_Type
,IL.Invoice_Number
,IH.Invoice_Category
,IH.Invoice_Site
,IH.Source_Document
,IH.Source_Date
,IL.Customer_ID
,CT.Old_Customer_ID
,CT.Customer_Name
,CT.Customer_Country_ISO AS Customer_Country
,CT.Market
,CT.Escalao
,CT.Rating_B2B_B2C
,PR.Product_Reference
,PR.Product
,PR.Colour_ID
,PR.Colour
,PR.Size
,PR.Design
,PR.STK_Product_Family
,PR.FAM1
,PR.FAM2
,PR.FAM3
,PR.Product_Line
,OH.Order_Date
,YEAR(OH.Order_Date) AS Order_Year
,CASE
WHEN OH.Order_Type IS NOT NULL THEN OH.Order_Type
WHEN IL.Order_ID IS NOT NULL THEN LEFT(IL.Order_ID,3)
ELSE NULL
END AS Order_Type
,IL.Order_ID
,CASE
WHEN OH.Order_Number IS NOT NULL THEN OH.Order_Number
WHEN IL.Order_ID IS NOT NULL AND IL.Order_ID <> '' THEN RIGHT(IL.Order_ID, LEN(IL.Order_ID) - 7)
ELSE NULL
END AS Order_Number
,OH.Origin
,OH.Delivery_Country
,OH.Stock_Site AS Warehouse
,OH.Customer_Order_Ref
,IL.Invoiced_Quantity
,CAST(OL.Max_Price_A * COALESCE(ERH.Currency_Rate,1) AS DECIMAL(15,4)) AS Ref_Price_A
,IL.Unit_Net_Price
,IL.Unit_Gross_Price
,IL.Cost_Price
,IL.Margin
,(CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Line_Net_Price AS Line_Net_Price
,IL.Line_Gross_Price
,IL.Line_Cost_Price
,IL.Currency
,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Unit_Net_Price * COALESCE(ER.Currency_Rate,1)) AS Unit_Net_Price_EUR
,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Unit_Gross_Price * COALESCE(ER.Currency_Rate,1)) AS Unit_Gross_Price_EUR
--,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Unit_Gross_Price * COALESCE(ER.Currency_Rate,1)) AS Natural_Price
,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Cost_Price * COALESCE(ER.Currency_Rate,1)) AS Cost_Price_EUR
,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Margin * COALESCE(ER.Currency_Rate,1)) AS Margin_EUR
,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Line_Net_Price * COALESCE(ER.Currency_Rate,1)) AS Line_Net_Price_EUR
,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Line_Gross_Price * COALESCE(ER.Currency_Rate,1)) AS Line_Gross_Price_EUR
--,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Line_Gross_Price * COALESCE(ER.Currency_Rate,1)) AS Natural_Line
,((CASE WHEN IH.Invoice_Category = 'Credit Note' THEN -1 ELSE 1 END) * IL.Line_Cost_Price * COALESCE(ER.Currency_Rate,1)) AS Line_Cost_Price_EUR
,CT.Special_Customer
,PR.Reference_Type
,CT.Area_Comercial
,CT.Market_Comercial
,CT.Zone_Comercial
,CT.Account_Manager
,CT.Sales_Manager
,CT.Sales_Director
,IL.[ICMS_Base]
,IL.[ICMS_Value]
,IL.[IPI_Base]
,IL.[IPI_Value]
,IL.[ICMS_ST_Base]
,IL.[ICMS_ST_Value]
,IL.[ICMS_ST_FinalValue]
,IL.Single_Identifier
FROM [BD_BI].[dwm].[Invoice_Line] AS IL
LEFT JOIN [BD_BI].[dwm].[Invoice_Header] AS IH ON IL.Invoice_Number = IH.Invoice_Number
LEFT JOIN [BD_BI].[dwm].[Customer] AS CT ON IL.Customer_ID = CT.Customer_ID
LEFT JOIN [BD_BI].[dwm].[Product] AS PR ON IL.Sage_Prod_ID = PR.Sage_Prod_ID
LEFT JOIN [BD_BI].[dwm].[Order_Header] AS OH ON IL.Order_ID = OH.Order_ID
LEFT JOIN [BD_BI].[dwm].[Exchange_Rate] AS ER ON IL.Invoice_Date >= ER.Start_Date AND IL.Invoice_Date <= ER.End_Date AND IL.Currency = ER.Source_Currency AND ER.Target_Currency = 'EUR'
LEFT JOIN [BD_BI].[dwm].[Exchange_Rate] AS ERH ON OH.Order_Date >= ERH.Start_Date AND OH.Order_Date <= ERH.End_Date AND OH.Currency = ERH.Source_Currency AND ERH.Target_Currency = 'EUR'
LEFT JOIN (
SELECT
Order_ID AS Join_Order_ID
,Sage_Prod_ID AS _Join_Sage_Prod_ID
,MAX(Ref_Price_A) AS Max_Price_A
FROM [BD_BI].[dwm].[Order_Line]
GROUP BY
Order_ID
,Sage_Prod_ID
) AS OL ON IL.Order_ID = OL.Join_Order_ID AND IL.Sage_Prod_ID = OL._Join_Sage_Prod_ID
This query takes around 3 minutes to run in SQL Server Management Studio (SSMS), but the job in Pentaho takes around 2 hours to finish running. The table output step does a truncate table every time the job runs, and the query outputs 2,663,588 rows.
I have already attempted the following optimizations:
- Creating indexes, which made the query run in 4 minutes in SSMS, but with no noticeable improvement in Pentaho.
- Replacing the existing sub-query with a Common Table Expression (CTE), but this did not improve performance.