My question deals with selecting items from a series that were not present in a previous observation. I am trying to adapt a technique from SQL: selecting rows where column value changed from previous row.
I have an application that includes a table that stores a list of apps installed on devices. Devices submit this data daily, and the full list is saved each day. I have been exporting new records each day for import into a reporting system but the data is growing very large because my reporting dataset has to retain a much larger time range that the actual application. And the way the app stores this data is not optimal for reporting. I would like to export only records that indicate that a new app has been installed on a device since the last time I performed a data export. The report_ids of last export and for the most recent report from the device are known. The data structure is not ideal. Essentially I am trying to convert a time series of data into a change log.
I don't own the source application so I can't change the database's data structures, triggers, etc.
*** Please see fiddle for data structure and sample data: http://sqlfiddle.com/#!9/ec6040/17
SQL:
-- Sample data:
CREATE TABLE last_exported_reports
(report_id int, mobile_device_id int);
INSERT INTO last_exported_reports
(mobile_device_id, report_id)
VALUES
(1, 1), (2, 6), (3, 7);
CREATE TABLE mobile_devices_denormalized
(mobile_device_id int, last_report_id int);
INSERT INTO mobile_devices_denormalized
(mobile_device_id, last_report_id)
VALUES
(1, 4), (2, 6), (3, 8);
CREATE TABLE reports
(`report_id` int, `mobile_device_id` int, `date_entered_epoch` bigint(32));
INSERT INTO reports
(`report_id`, `mobile_device_id`, `date_entered_epoch`)
VALUES
(1, 1, 1529981397691),
(2, 2, 1529981397692),
(3, 3, 1529981397693),
(4, 1, 1529981397694),
(5, 2, 1529981397695),
(6, 2, 1529981397696),
(7, 3, 1529981397697),
(8, 3, 1529981397698);
CREATE TABLE mobile_device_installed_applications
(`report_id` int, `identifier` varchar(8), `application_short_version` varchar(5));
INSERT INTO mobile_device_installed_applications
(`report_id`, `identifier`, `application_short_version`)
VALUES
(1, 'Chrome', 'c1.1'), -- device 1
(1, 'Word', 'w2.1'), -- device 1
(2, 'Skype', 's1.0'), -- device 2
(3, 'Excel', 'e3.0'), -- device 3
(4, 'Chrome', 'c2.1u'), -- device 1
(4, 'Word', 'w2.1n'), -- device 1
(4, 'Excel', 'w2.0'), -- device 1
(5, 'Skype', 's1.0'), -- device 2
(6, 'Skype', 's1.9'), -- device 2
(7, 'Excel', 'e3.0'), -- device 3
(8, 'Excel', 'e3.0'); -- device 3
SELECT
mdd.mobile_device_id AS md_id, mdia.report_id, mdia.identifier, mdia.application_short_version AS ver
FROM
-- List of all devices
mobile_devices_denormalized mdd
INNER JOIN
-- Add in apps
mobile_device_installed_applications mdia
ON
-- Only if they are from the last report from a device
mdia.report_id = mdd.last_report_id
AND
-- And only if the latest report has not already been exported
mdia.report_id NOT IN (select report_id FROM last_exported_reports)
AND
-- And only if the app in the new report was not in the last exported report
NOT EXISTS (
SELECT *
FROM mobile_device_installed_applications exported
WHERE exported.identifier = mdia.identifier
AND exported.report_id = mdd.last_report_id
)
;
If I run the above without the NOT EXISTS clause, it works fine in that I get all app records for new reports as expected. When I add in the final step, the subquery to get rid of repetative apps, I get no results.
RESULT IF RUN WITHOUT THE SUBQUERY CLAUSE:
md_id report_id identifier ver
1 4 Chrome c2.1u
1 4 Word w2.1n
1 4 Excel w2.0
3 8 Excel e3.0
- This is correct for device 1 because report 4 is most recent and is not the previously exported report (4 != 1)
- This is correct for device 2 because report 6 is most recent but it was already exported so no records are exported.
- This is correct for device 3 because report 8 is most recent and is not the previously exported report (8 != 7)
RESULT WHEN I ADD IN THE FINAL SUBQUERY CLAUSE:
No records.
EXPECTED RESULT:
md_id report_id identifier ver
1 4 Excel w2.0
The above record should have been printed because it is in the most recent report (report_id=4) for device #1 and the app was not on the device the last time it was exported (report_id=1).