My dataset consists of daily timeseries for different companies and I work with PostgreSQL. I have a indicator variable in my dataset, taking values 1, -1 and most of the time 0. If the indicator variable is not 0, and a company has a missing value in another column on that day (indicator day) or the next day, the company shall be excluded from the dataset entirely.
We can think of the following example data:
date company indicator value
2012-01-02 A 0 2
2012-01-02 B 0 9
2012-01-02 C 0 1
2012-01-02 D 0 3
2012-01-03 A 1 NULL
2012-01-03 B 0 NULL
2012-01-03 C -1 1
2012-01-03 D 0 2
2012-01-04 A 0 1
2012-01-04 B 0 1
2012-01-04 C 0 NULL
2012-01-04 D 1 4
2012-01-05 A 0 4
2012-01-05 B 0 2
2012-01-05 C 0 1
2012-01-05 D 0 7
So A has to be excluded because it has a missing value on the indicator day, and C because it has a missing value on the day following the indicator day.
I tried the following:
CREATE TABLE to_delete
AS SELECT * FROM mytable
WHERE company IN(
SELECT company
FROM mytable
WHERE date BETWEEN (SELECT date FROM mytable WHERE indicator != 0)
AND (SELECT date+1 FROM mytable WHERE indicator != 0)
AND indicator != 0)
AND date BETWEEN (SELECT date FROM mytable WHERE indicator != 0)
AND (SELECT date+1 FROM mytable WHERE indicator != 0)
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_delete);
It works if there exists only one indicator value that is unequal to zero in the example dataset. With more than one, PostgreSQL returns an error, saying that my subquery returned more than one row.
I am really struggling to fix this problem. Do you know a solution, or may be a completely other approach to achieve the wanted result?