19

So I'm trying to count the number of parts, number of tasks, the quantity in each job and the time that it took to manufacture each job but I'm getting some funky results. If I run this:

SELECT
  j.id, 
    mf.special_instructions,
  count(distinct p.id) as number_of_different_parts,
  count(distinct t.id) as number_of_tasks,
  SUM(distinct j.quantity) as number_of_assemblies,
  SUM(l.time_elapsed) as time_elapsed

FROM
  sugarcrm2.mf_job mf
INNER JOIN ramses.jobs j on
  mf.id = j.mf_job_id
INNER JOIN ramses.parts p on
  j.id = p.job_id
INNER JOIN ramses.tasks t on
  p.id = t.part_id
INNER JOIN ramses.batch_log l on
  t.batch_id = l.batch_id

WHERE 
  mf.job_description                LIKE "%BACKBLAZE%" OR
  mf.customer_name                  LIKE "%BACKBLAZE%" OR
  mf.customer_ref                   LIKE "%BACKBLAZE%" OR
  mf.technical_company_name LIKE "%BACKBLAZE%" OR
  mf.description                        LIKE "%BACKBLAZE%" OR
  mf.name                                   LIKE "%BACKBLAZE%" OR
  mf.enclosure_style                LIKE "%BACKBLAZE%" OR 
    mf.special_instructions     LIKE "%BACKBLAZE%"
Group by j.id

and I now get accurate parts and tasks numbers but the time_elapsed sum isn't correct. What could the problem be?

When I try it with distinct I get a veeeeery low number (like something between 1 and 30 when I'm looking for something closer to 10,000.)

UPDATE: here is the create code:

http://pastebin.com/nbhU9rYh

http://pastebin.com/tdmAkNr4

http://pastebin.com/0TFCUaeQ

http://pastebin.com/fugr8C9U

http://pastebin.com/Zq0bKG2L

http://pastebin.com/k5rESUrq

The relationships are like this:

  • mf_job info is linked to a job
  • jobs have parts
  • parts have tasks
  • tasks are in batches
  • batch_log is a table with all of the starts and stops for the batches of tasks, it has a start_time and a stop_time and a time_elapsed.

I am trying to get all of the time_elapsed from the batch_log for each mf_job with the word backblaze in one of it's fields along with the number of parts, tasks and assemblies. This all needs to be grouped by job.id or mf_job.id

APC
  • 144,005
  • 19
  • 170
  • 281
davidahines
  • 3,976
  • 16
  • 53
  • 87
  • Depending on your data distribution, you could be counting the same `l.time_elapsed` more than once in the `SUM` operation on it (if any given `t.batch_id` shows up more than once based on your join). Is this desirable? – Ben Mosher Aug 29 '11 at 17:38
  • What exactly is stored in the time_elapsed column? secs, minutes, hours, days? hh:mm:ss? – Icarus Aug 29 '11 at 17:38
  • @Icarus: It's stored as hh:mm:ss. – davidahines Aug 29 '11 at 17:55
  • @Ben Mosher: I think you're right. It's not desirable, no. I've heard that you can replace the join aggregate with a nested select but I have no idea how to do that so that it will still end up on the same row as the id. – davidahines Aug 29 '11 at 17:55
  • @dah try changing your elapsed_time sum to: select SEC_TO_TIME(SUM(SECOND(elapsed_time))). Have a look at this for reference: http://onlamp.com/pub/a/onlamp/excerpt/mysqlckbk/index3.html?page=3 – Icarus Aug 29 '11 at 19:22
  • 1
    @Icarus, very close, `select SEC_TO_TIME(SUM(TIME_TO_SEC(elapsed_time)))` is actually correct, the function `SECOND()` will only ever give a maximum result of 59 seconds. – Johan Aug 30 '11 at 09:44
  • I tried this but all it did was change the formatting, it didn't get me accurate results. – davidahines Aug 31 '11 at 14:24
  • 1
    Please include table schema. And clarify the elapsed_time meant to be a SUM based on batch_id or batch_id + job_id – ajreal Sep 07 '11 at 16:11
  • 1
    @dah can you give the column ddl from show create table? – J.D. Fitz.Gerald Sep 07 '11 at 16:12
  • I can definitely do that, do you mean from like, every table or just batch_log? – davidahines Sep 07 '11 at 16:14
  • 1
    What is relevant here is knowing the behaviour of the relationships between the tables. Basically, what joins can cause duplication. You may then need different levels of aggregation with nested sub-queries. The more you can desribe about the tables and their relationships the better. – MatBailie Sep 07 '11 at 16:22
  • 11
    +1 for putting up almost all your rep on a bounty. – Matthew Sep 07 '11 at 16:43
  • Honestly, what I mainly care about is fixing this, it's really important to my boss. – davidahines Sep 07 '11 at 16:52
  • @Dems, I put up the table info and I think it's the batch_log causing duplication? – davidahines Sep 07 '11 at 16:52
  • @J.D. Fitz Gerald: I put up the info, is this what you're asking for? – davidahines Sep 07 '11 at 17:25
  • Could you post some data (both source and current result + expected result) for a single job? – StevieG Sep 08 '11 at 08:53
  • 1
    +1 for putting the create code. Some test inserts would be nice also. – daniloquio Sep 12 '11 at 18:08

4 Answers4

3

Try and rewrite the query to:

SELECT
  j.id, 
  mf.special_instructions,
  count(p.id) as number_of_different_parts,
  count(t.id) as number_of_tasks,
  SUM(j.quantity) as number_of_assemblies,
  SEC_TO_TIME(SUM(l.seconds_elapsed)) as time_elapsed

FROM
  sugarcrm2.mf_job mf
INNER JOIN ramses.jobs j on
  mf.id = j.mf_job_id
INNER JOIN ramses.parts p on
  j.id = p.job_id
INNER JOIN ramses.tasks t on
  p.id = t.part_id
INNER JOIN (
            SELECT rl.batch_id
                  , SUM(TIME_TO_SEC(rl.time_elapsed)) as seconds_elapsed
            FROM ramses.batch_log rl 
            GROUP BY rl.batch_id
            ) l ON (t.batch_id = l.batch_id)

WHERE 
  mf.job_description                LIKE "%BACKBLAZE%" OR
  mf.customer_name                  LIKE "%BACKBLAZE%" OR
  mf.customer_ref                   LIKE "%BACKBLAZE%" OR
  mf.technical_company_name         LIKE "%BACKBLAZE%" OR
  mf.description                    LIKE "%BACKBLAZE%" OR
  mf.name                           LIKE "%BACKBLAZE%" OR
  mf.enclosure_style                LIKE "%BACKBLAZE%" OR 
  mf.special_instructions           LIKE "%BACKBLAZE%"
GROUP BY j.id WITH ROLLUP
Johan
  • 74,508
  • 24
  • 191
  • 319
2

You need to change the query to:

SELECT
  ...
  SEC_TO_TIME(SUM(TIME_TO_SEC(l.time_elapsed))) as time_elapsed

Also, the line of LIKE '%...' will make the query uber slow, because no indexes on this can be used.

If you are able to use MyISAM, you can use a fulltext index on those columns and use code like:

WHERE MATCH(mf.job_description,mf.customer_name,mf.customer_name,...) 
      AGAINST ('BACKBLAZE' IN NATURAL LANGUAGE MODE)

See:
http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html
http://www.petefreitag.com/item/477.cfm
http://dev.mysql.com/doc/refman/5.0/en/date-and-time-functions.html#function_time-to-sec

Johan
  • 74,508
  • 24
  • 191
  • 319
  • We're using innodb actually and I tried changing the query and now I'm still getting the wrong number of seconds. – davidahines Aug 31 '11 at 17:39
-1

It sounds like the problem is that multiple tasks can be in the same batch, and/or multiple parts can be in the same task. Say for example that your job has 3 parts, and each part has a task, and all 3 tasks are in the same batch. You'll add the time for that batch three times. But distinct won't work either, since if you have 5 different batches that all took 300 seconds they won't be considered distinct.

In a situation like this, a subquery is usually the way to go. Instead of joining directly with batch_log, you would join with a subquery that selected distinct j.id (or p.job_id), l.batch_id, and l.time_elapsed (the first is for joining, the second just for proper calculation of distinct, and the third the actual value to use). Then you can sum l.time_elapsed from there. This way each batch is counted exactly once.

Anomie
  • 92,546
  • 13
  • 126
  • 145
-1

batches(l) table doesn't have a field called time_elapsed tasks does so it's either

SUM(t.time_elapsed) as time_elapsed

-or-

SUM(l.actual_time) as time_elapsed
Charlie
  • 1,062
  • 6
  • 9