I'm trying to analyze tweets using manually assigned categories. Everything's stored in a MySQL database. I can add and remove tweets, categories, and the relationships between them without any problems.
Including categories using OR logic works as expected. If I want to find tweets categorized as "Venezuela" or "Maduro," I send those two terms over in an array called $include
with $include_logic
set to "or"
. Tweets categorized under either category are returned. Great!
The problems start when I try to using AND logic (that is, tweets categorized under all included terms, e.g., both Venezuela and Maduro) or when I try excluding categories.
Here's the code:
function filter_tweets($db, $user_id, $from_utc, $to_utc, $include = null, $include_logic = null, $exclude = null) {
$include_sql = '';
if (isset($include)) {
$include_sql = 'AND (';
$logic_op = '';
foreach ($include as $cat) {
$include_sql .= "{$logic_op}cats.name = '$cat' ";
$logic_op = ($include_logic != 'and') ? 'OR ' : 'AND '; # AND doesn't work here
}
$include_sql .= ')';
}
$exclude_sql = ''; # Nothing I've tried with this works.
$sql = "
SELECT DISTINCT tweets.id FROM tweets
LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id
WHERE tweets.user_id = $user_id
AND created_at
BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
AND '{$to_utc->format('Y-m-d H:i:s')}'
$include_sql
$exclude_sql
ORDER BY tweets.created_at ASC;";
return db_fetch_all($db, $sql);
}
where db_fetch_all()
is
function db_fetch_all($con, $sql) {
if ($result = mysqli_query($con, $sql)) {
$rows = mysqli_fetch_all($result);
mysqli_free_result($result);
return $rows;
}
die("Failed: " . mysqli_error($con));
}
and tweets_cats
is the junction table between the tweets
and cats
tables.
After reading up on joins and junction tables, I understand why my code doesn't work in the two cases mentioned. It can only look at one tweet and corresponding category at a time. So asking it to omit a tweet categorized as "X" is moot because it won't omit it when the same tweet is encountered and categorized as "Y".
What I don't understand is how to modify the code so that it does work. I haven't found any examples of people trying to do anything similar. Perhaps I'm not searching for the right terms. I'd be grateful if someone could point me to a good resource for working with junction tables in MySQL similar to how I'm using them.
Edit: Here's working SQL created by the function using the example mentioned above including "Venezuela" OR "Maduro" on the VP twitter account with the date range set to tweets so far this month (EST converted to UTC).
SELECT DISTINCT tweets.id FROM tweets
LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id
WHERE tweets.user_id = 818910970567344128
AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
AND (cats.name = 'Venezuela' OR cats.name = 'Maduro' )
ORDER BY tweets.created_at ASC;
Update: Here's working SQL that adheres to AND logic for included categories. Many thanks to @Strawberry for the suggestion!
SELECT tweets.id FROM tweets
LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id
WHERE tweets.user_id = 818910970567344128
AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
AND cats.name IN ('Venezuela', 'Maduro')
GROUP BY tweets.id
HAVING COUNT(*) = 2
ORDER BY tweets.created_at ASC;
This is a bit beyond my SQL comprehension, though. I'm glad it works. I just wish I understood how.
Update 2: Here's working SQL that excludes categories. I realized that the AND/OR logic that applies to included categories also applies to excluded ones. This example uses OR logic. The syntax is essentially Q1 NOT IN (Q2), where Q2 is what's excluded, which is basically the same query used for inclusion.
SELECT id FROM tweets
WHERE user_id = 818910970567344128
AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
AND id NOT IN (
SELECT tweets.id FROM tweets
LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id
WHERE tweets.user_id = 818910970567344128
AND created_at BETWEEN '2019-02-01 05:00:00' AND '2019-03-01 05:00:00'
AND cats.name IN ('Venezuela','Maduro')
)
ORDER BY created_at ASC;
Update 3: Here's the working code.
function filter_tweets($db, $user_id, $from_utc, $to_utc,
$include = null, $include_logic = null,
$exclude = null, $exclude_logic = null) {
if (isset($exclude)) {
$exclude_sql = "
AND tweets.id NOT IN (\n"
. include_tweets($user_id, $from_utc, $to_utc, $exclude, $exclude_logic)
. "\n)";
} else {
$exclude_sql = '';
}
if (isset($include)) {
$sql = include_tweets($user_id, $from_utc, $to_utc, $include, $include_logic, $exclude_sql);
} else {
$sql = "
SELECT id FROM tweets
WHERE user_id = $user_id
AND created_at
BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
AND '{$to_utc ->format('Y-m-d H:i:s')}'
$exclude_sql";
}
$sql .= "\nORDER BY tweets.created_at ASC;";
return db_fetch_all($db, $sql);
}
which relies on this additional function for generating SQL:
function include_tweets($user_id, $from_utc, $to_utc, $include, $logic, $exclude_sql = '') {
$group_sql = '';
$include_sql = 'AND cats.name IN (';
$comma = '';
foreach ($include as $cat) {
$include_sql .= "$comma'$cat'";
$comma = ',';
}
$include_sql .= ')';
if ($logic == 'and')
$group_sql = 'GROUP BY tweets.id HAVING COUNT(*) = ' . count($include);
return "
SELECT tweets.id FROM tweets
LEFT OUTER JOIN tweets_cats ON tweets.id = tweets_cats.tweet_id
LEFT OUTER JOIN cats ON tweets_cats.cat_id = cats.id
WHERE tweets.user_id = $user_id
AND created_at
BETWEEN '{$from_utc->format('Y-m-d H:i:s')}'
AND '{$to_utc ->format('Y-m-d H:i:s')}'
$include_sql
$group_sql
$exclude_sql";
}