2

I'm trying to analyze tweets using manually assigned categories. Everything's stored in a MySQL database. I can add and remove tweets, categories, and the relationships between them without any problems.

Including categories using OR logic works as expected. If I want to find tweets categorized as "Venezuela" or "Maduro," I send those two terms over in an array called $include with $include_logic set to "or". Tweets categorized under either category are returned. Great!

The problems start when I try to using AND logic (that is, tweets categorized under all included terms, e.g., both Venezuela and Maduro) or when I try excluding categories.

Here's the code:

function filter_tweets($db, $user_id, $from_utc, $to_utc, $include = null, $include_logic = null, $exclude = null) {

    $include_sql = '';
    if (isset($include)) {
        $include_sql = 'AND (';
        $logic_op = '';
        foreach ($include as $cat) {
            $include_sql .= "{$logic_op}cats.name = '$cat' ";
            $logic_op = ($include_logic != 'and') ? 'OR ' : 'AND '; # AND doesn't work here
        }
        $include_sql .= ')';
    }
    $exclude_sql = ''; # Nothing I've tried with this works.

    $sql = "
        SELECT DISTINCT tweets.id FROM tweets
            LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
            LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
        WHERE tweets.user_id = $user_id
            AND created_at
                BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
                    AND '{$to_utc->format('Y-m-d H:i:s')}'
            $include_sql
            $exclude_sql
        ORDER BY tweets.created_at ASC;";

    return db_fetch_all($db, $sql);   
}

where db_fetch_all() is

function db_fetch_all($con, $sql) {

    if ($result = mysqli_query($con, $sql)) {
        $rows = mysqli_fetch_all($result);
        mysqli_free_result($result);
        return $rows;
    }
    die("Failed: " . mysqli_error($con)); 
}

and tweets_cats is the junction table between the tweets and cats tables.

After reading up on joins and junction tables, I understand why my code doesn't work in the two cases mentioned. It can only look at one tweet and corresponding category at a time. So asking it to omit a tweet categorized as "X" is moot because it won't omit it when the same tweet is encountered and categorized as "Y".

What I don't understand is how to modify the code so that it does work. I haven't found any examples of people trying to do anything similar. Perhaps I'm not searching for the right terms. I'd be grateful if someone could point me to a good resource for working with junction tables in MySQL similar to how I'm using them.


Edit: Here's working SQL created by the function using the example mentioned above including "Venezuela" OR "Maduro" on the VP twitter account with the date range set to tweets so far this month (EST converted to UTC).
SELECT DISTINCT tweets.id FROM tweets
    LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
    LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id
WHERE tweets.user_id = 818910970567344128
    AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
    AND (cats.name = 'Venezuela' OR cats.name = 'Maduro' )
ORDER BY tweets.created_at ASC;


Update: Here's working SQL that adheres to AND logic for included categories. Many thanks to @Strawberry for the suggestion!
SELECT tweets.id FROM tweets
    LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
    LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
WHERE tweets.user_id = 818910970567344128
    AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
    AND cats.name IN ('Venezuela', 'Maduro')
GROUP BY tweets.id
HAVING COUNT(*) = 2
ORDER BY tweets.created_at ASC;

This is a bit beyond my SQL comprehension, though. I'm glad it works. I just wish I understood how.


Update 2: Here's working SQL that excludes categories. I realized that the AND/OR logic that applies to included categories also applies to excluded ones. This example uses OR logic. The syntax is essentially Q1 NOT IN (Q2), where Q2 is what's excluded, which is basically the same query used for inclusion.
SELECT id FROM tweets
WHERE user_id = 818910970567344128
    AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
    AND id NOT IN (
        SELECT tweets.id FROM tweets
            LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
            LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
        WHERE tweets.user_id = 818910970567344128
            AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
            AND cats.name IN ('Venezuela','Maduro')
    )
ORDER BY created_at ASC;


Update 3: Here's the working code.
function filter_tweets($db, $user_id, $from_utc, $to_utc,
                       $include = null, $include_logic = null,
                       $exclude = null, $exclude_logic = null) {

    if (isset($exclude)) {
        $exclude_sql = "
              AND tweets.id NOT IN (\n"
            . include_tweets($user_id, $from_utc, $to_utc, $exclude, $exclude_logic)
            . "\n)";
    } else {
        $exclude_sql = '';
    }
    if (isset($include)) {
        $sql = include_tweets($user_id, $from_utc, $to_utc, $include, $include_logic, $exclude_sql);
    } else {
        $sql = "
            SELECT id FROM tweets
            WHERE user_id = $user_id
              AND created_at
                BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
                    AND '{$to_utc  ->format('Y-m-d H:i:s')}'
              $exclude_sql";
    }
    $sql .= "\nORDER BY tweets.created_at ASC;";

    return db_fetch_all($db, $sql);   
}

which relies on this additional function for generating SQL:

function include_tweets($user_id, $from_utc, $to_utc, $include, $logic, $exclude_sql = '') {

    $group_sql = '';
    $include_sql = 'AND cats.name IN (';
    $comma = '';
    foreach ($include as $cat) {
        $include_sql .= "$comma'$cat'";
        $comma = ',';
    }
    $include_sql .= ')';
    if ($logic == 'and')
        $group_sql = 'GROUP BY tweets.id HAVING COUNT(*) = ' . count($include);
    return "
        SELECT tweets.id FROM tweets
          LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
          LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id 
        WHERE tweets.user_id = $user_id
          AND created_at
            BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
                AND '{$to_utc  ->format('Y-m-d H:i:s')}'
          $include_sql
        $group_sql
        $exclude_sql";
}
svadhisthana
  • 163
  • 2
  • 16
  • 1
    you'll probably want to use some parenthesis here, just a guess as I don't see none, but there is a difference between `AND this OR that` and `AND (this OR that)` ... without seeing the full sql, who knows. One requires whats before the and, and then this OR that, the second one requires whats before the and and Either this or that. – ArtisticPhoenix Feb 10 '19 at 00:49
  • @ArtisticPhoenix: Good observation. I'll fix that and edit my question. Thank you! – svadhisthana Feb 10 '19 at 00:55
  • 2
    well, here's one idea to start the ball rolling: `WHERE ... IN(...) GROUP BY ... HAVING COUNT([DISTINCT]...) = 2` (where '2' is equal to the number or arguments in IN) – Strawberry Feb 10 '19 at 01:00
  • @Strawberry: That works perfectly for AND-type inclusion. Thank you so much! – svadhisthana Feb 10 '19 at 04:53

1 Answers1

0

One way to do this is to join your tweets table against the junction table multiple times, e.g. like this:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
  JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
WHERE
  tweet_cats_foo.name = 'foo' AND tweet_cats_bar.name = 'bar'

or, equivalently, like this:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
  JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
    AND tweet_cats_bar.name = 'bar'

Note that, for simplicity, I'm assuming above that your junction table directly contains the category names. If you insist on using numeric category IDs but searching for categories by name, I'd recommend creating a view that joins the category and junction tables together using the numeric category ID and using that view instead of the actual junction table in your query. This saves you from having to include a whole bunch of unnecessary boilerplate code in the query just for finding the numeric category IDs.

For exclusion queries, you can use a LEFT JOIN and check that no matching record exists in the junction table (in which case all the columns from that table will be NULL), like this:

SELECT tweets.*
FROM tweets
  LEFT JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
WHERE
  tweet_cats_foo.tweet_id IS NULL  -- could use any non-null column here

(Using this method, you do need to include the tweet_cats_foo.name = 'foo' condition in the LEFT JOIN clause instead of the WHERE clause.)

Of course, you can also combine these. For example, to find tweets in category foo but not in bar, you could do:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
  LEFT JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
    AND tweet_cats_bar.name = 'bar'
WHERE
  tweet_cats_bar.tweet_id IS NULL

or, again equivalently:

SELECT tweets.*
FROM tweets
  LEFT JOIN tweet_cats AS tweet_cats_foo
    ON tweet_cats_foo.tweet_id = tweets.id
    AND tweet_cats_foo.name = 'foo'
  LEFT JOIN tweet_cats AS tweet_cats_bar
    ON tweet_cats_bar.tweet_id = tweets.id
    AND tweet_cats_bar.name = 'bar'
WHERE
  tweet_cats_foo.tweet_id IS NOT NULL
  AND tweet_cats_bar.tweet_id IS NULL

Ps. Another way to find category intersections, as suggested by Strawberry in the comments above, is to do a single join against the junction table, group the results by the tweet ID, and use a HAVING clause to count how many matching categories were found for each tweet:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats ON tweet_cats.tweet_id = tweets.id
WHERE
   tweet_cats.name IN ('foo', 'bar')
GROUP BY tweets.id
HAVING COUNT(DISTINCT tweet_cats.name) = 2

This method could also be generalized to handle exclusions by using a second (left) join, e.g. like this:

SELECT tweets.*
FROM tweets
  JOIN tweet_cats AS tweet_cats_wanted
    ON tweet_cats_wanted.tweet_id = tweets.id
    AND tweet_cats_wanted.name IN ('foo', 'bar')
  LEFT JOIN tweet_cats AS tweet_cats_unwanted
    ON tweet_cats_unwanted.tweet_id = tweets.id
    AND tweet_cats_unwanted.name IN ('baz', 'blorgh', 'xyzzy')
WHERE
  tweet_cats_unwanted.tweet_id IS NULL
GROUP BY tweets.id
HAVING COUNT(DISTINCT tweet_cats_wanted.name) = 2

I have not benchmarked these two approaches to see which one is more efficient, and I'd strongly recommend doing so before deciding which one to go with. In principle, I would expect the multiple-join method to be easier for the database engine to optimize, since it clearly maps to an intersection of joins, whereas for the GROUP BY ... HAVING method a naive database might end up wasting a lot of effort first finding all the tweets that match any of the categories, and only afterwards applying the HAVING clause to filter out everything but those that match all the categories. A simple test case for this could be the intersection of several very large categories with one very small one, which I would expect to be more efficient using the multiple-join method. But of course, one should always test such things instead of relying only on intuition.

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
  • Thank you very much for your answer. That's an interesting approach. It didn't occur to me to look for matches in the JOIN clause. – svadhisthana Feb 10 '19 at 22:21