2

This is a rather complicated question, but I will try my best to describe it as well as possible.

I currently have 2 tables that are joined on an insta_id.

The first table is called Soc_stat and it contains a lot of entries of Instagram posts. The columns I use in my SQL query are insta_id and main_tag.

The other table is called tags and includes an insta_id and all the tag_names that has been used with that insta_id.

fThe tags table looks like this.

enter image description here

My current SQL query looks like this

SELECT tag_names
FROM soc_stat
JOIN tags ON soc_stat.insta_id = tags.insta_id
WHERE main_tag = $hashtag

Now the issue is that currently I have 110000 rows in the tags table, and the query above is taking such a long time that it often times out. But I think I might have found a solution: The main_tag from the soc_stat table is always represented under tag_names in the tags table.

So maybe I don't have to join the two tables? Wouldn't it be possible to just check the tags table for all the insta_id's that contains a tag_name equal to $hashtagand then fetch all the tag_names that are linked to those insta_id's?

Edit

I just used added an explain to my Query on a hashtag I know there is problems with and the result is as follows:

enter image description here

And here I have done it on a much less used hashtag where it is still slow, but not nearly as slow

enter image description here

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

2 Answers2

1

When dealing with JOIN queries in nontrivially sized tables, it's always a good idea to write the query showing the column names. That's because it matters to performance which table provides each column. It's also easier to understand the intent of the query by reading it.

I am guessing about your query; please verify that both your hashtags and your main_tag come from your soc_stat table. Is this query correct?

SELECT soc_stat.hashtags
  FROM soc_stat
  JOIN tags on soc_stat.insta_id = tags.insta_id
 WHERE soc_stat.main_tag = $hashtag

Next we can sort out the intent of the query. It returns the hashtags column from each entry in your soc_stat table with the main_tag value you provide in $hashtag. But it also does a JOIN to tags, and never uses the contents of that table for either a WHERE clause or a SELECT clause.

I guess your query's intent is this:

Give the set of all tags associated with entries for which I provide the main_tag value.

But your query doesn't do that. I believe what you want is this:

SELECT DISTINCT tags.tag_names
  FROM soc_stat
  JOIN tags on soc_stat.insta_id = tags.insta_id
 WHERE soc_stat.main_tag = $hashtag

You should verify that this query yields something useful to you.

To get this query to work well, you need a couple of compound indexes. On your soc_stat table you should index the columns (main_tag, insta_id) into a single instance. To satisfy your query you need to find a particular value of main_tag in that table, and then you need to retrieve all the related values of insta_id. This compound index makes that efficient.

On tags you should index (insta_id, tag_names). That's because the query will come in to that table knowing insta_id and retrieve all the associated tag_names values.

DDL commands like this create those indexes:

CREATE INDEX index_ss_mt_ii ON soc_stat (main_tag, insta_id)
CREATE INDEX index_tg_ii_tn ON tags     (insta_id, tag_names)

Finally, I wrote SELECT DISTINCT in the query to eliminate duplicates. In large result sets, that will take a bit of time.

With respect, it's time for you to add an understanding of SQL indexing to your set of skills. The entire point of SQL is quick retrieval of data from large datasets, This online publication is excellent for that purpose. http://use-the-index-luke.com/

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Thanks for the response. And yeah I need to get better at SQL. I have to admit that I find SQL less interesting than other aspects of programming. But I am trying and hopefully I will get it eventually. I know it is important to know for optimal usage of databases. – Andreas Uldall Leonhard Sep 23 '15 at 11:14
  • As far as the hashtags that's a typo. It was supposed to be called tag_names. I've corrected it now – Andreas Uldall Leonhard Sep 23 '15 at 11:17
  • 1
    SQL is probably the first *declarative* rather than *procedural* language you've had to actually use. The questions of how you translate your intent into such a language are, obviously, worth working out. And, a warning: the world is full of programmers who don't want to know about SQL performance. Those programmers are the lawful prey of unscrupulous, expensive, and unhelpful database administrators. – O. Jones Sep 23 '15 at 11:17
  • Yeah. I am done with my computer science education in january and I have no doubts how beneficial it will be to become good at SQL. Especially when adding value as an employee to an employer. It is just very taxing doing something you're struggeling with. I hope to have some time after my exams to sit and just explore SQL in depth. – Andreas Uldall Leonhard Sep 23 '15 at 11:39
0

Your query -

SELECT hashtags FROM soc_stat 
JOIN tags on tags.insta_id = soc_stat.insta_id
WHERE soc_stat.main_tag = $hashtag

Now you can see the query health by using explain select ...

Also when large data involved its often missing indices makes the query slow

For the above query if not already added you need to add the following indices

alter table soc_stat add index main_tag_idx(main_tag);
alter table tags add index insta_id_idx(insta_id);

Make sure to take a backup of the table before adding the indices.

Abhik Chakraborty
  • 44,654
  • 6
  • 52
  • 63
  • Hey unfortunately I am very novice when it comes to SQL. I think this is probably the issue. Because I am not even sure what these indices are. But i'll research it and see if it can be done in phpadmin. I've backed up everything. I'll let you know how it goes :) – Andreas Uldall Leonhard Sep 23 '15 at 08:59
  • Here is the official doc on index...https://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html – Abhik Chakraborty Sep 23 '15 at 09:01
  • I tried adding the above but did not seem to solve my issue. Damn. I feel like I am in over my head. Need to figure out how to optimise this thing so the queries are not so slow. – Andreas Uldall Leonhard Sep 23 '15 at 09:53
  • Use `expalin SELECT hashtags FROM soc_stat JOIN tags on tags.insta_id = soc_stat.insta_id WHERE soc_stat.main_tag = $hashtag` and share the result into the question. Make sure to replace `$hashtag` with some real value. – Abhik Chakraborty Sep 23 '15 at 10:14
  • Looks like the index `main_tag` is not getting picked up in the query. What is the data type of `main_tag` and what value you are sending for `main_tag` in the query i.e `$hashtag` ? – Abhik Chakraborty Sep 23 '15 at 10:25
  • main_tag's datatype is varchar(65) utf8_unicode_ci The $hashtag is just a string I have added manually and then I run the SQL query when I open the php file in my browser. Something interesting just happened though. In phpadmin I wrote in the query manually and it returned the results in 0.0029 sec. Which is fast. Hmm this is very odd that it times out when I run it through my php file. – Andreas Uldall Leonhard Sep 23 '15 at 10:33
  • Remeber even if the column is indexed the optimizer may still choose to scan the entire table if the where condition is returning data close to the total in the table. Now there are several factors associated, for example the RAM since the indexes are loaded into real memory and if you have less memory then it needs to read from disk resulting entire table scan, the other factor is query cache, its always better to cache the query you can check out how to cache query in mysql. – Abhik Chakraborty Sep 23 '15 at 10:37
  • Actually maybe I should delete this question. It might actually be the sorting that is done in my php document that is slowing it all down. I am trying to sort an array of 103000 where I find the 10 tags that is used the most. This involves 2 loops that sorts all of the array and removes the input that occours the most and then does the whole thing over until it has removed the top 10 most used tags. – Andreas Uldall Leonhard Sep 23 '15 at 10:38