3

The problem:

I want to move the links of the categories from the table companies_1 into the company_categories table. The company_id in the company_categories table need to be equal to the id of the companies_2 table. The records of the companies_1 and the companies_2 table are linked by the "name"-column.

  • The current code below took me over a night, still unfinished! I want to learn to be more efficient and speed this progress up. I feel like there is very much to optimize because there are A LOT of company records.
  • Another issue was that i found no way how to check where my query was while looping (resulting in no way to check the progress). Because the progress took so long i killed the query and I'm searching for a better way to solve this issue.

The information:

There is a table with companies like:

----------------------------------------
| companies_1                          |
----------------------------------------
| id   |  category_id   | name         |
----------------------------------------
| 1    |  1             | example-1    |
| 2    |  2             | example-1    |
| 3    |  1             | example-2    |
| 4    |  2             | example-2    |
| 5    |  3             | example-2    |
| 6    |  1             | example-3    |
----------------------------------------

A table with the DISTINCT company names:

-------------------------
| companies_2           |
-------------------------
| id   |   name         |
-------------------------
| 1    |   example-1    |
| 2    |   example-2    |
| 3    |   example-3    |
-------------------------

A categories table, like:

-------------------------
| categories            |
-------------------------
| id   |  name          |
-------------------------

And a junction table, like:

---------------------------------
| company_categories            |
---------------------------------
| company_id   |  category_id   |
---------------------------------

The current code:

This code works, but is far from efficient.

DELIMITER $$
 DROP PROCEDURE IF EXISTS fill_junc_table$$
 CREATE PROCEDURE fill_junc_table()
 BEGIN
 DECLARE r  INT;
 DECLARE i  INT;
 DECLARE i2  INT;
 DECLARE loop_length  INT;
 DECLARE company_old_len  INT;
 DECLARE _href  VARCHAR(255);
 DECLARE cat_id  INT;
 DECLARE comp_id  INT;

 SET r = 0;
 SET i = 0;
 SET company_old_len = 0;
 SELECT COUNT(*) INTO loop_length FROM companies;

 WHILE i  < loop_length DO
  SELECT href INTO _href FROM company_old LIMIT i,1;
  SELECT id INTO comp_id FROM companies WHERE site_href=_href;
  SELECT COUNT(*) INTO company_old_len FROM company_old WHERE href=_href;
  SET i2 = 0;
  WHILE i2  < company_old_len DO
   SELECT category_id INTO cat_id FROM company_old WHERE href=_href LIMIT i2,1;
   INSERT INTO company_categories (company_id, category_id) VALUES (comp_id, cat_id);
   SET  r = r + 1;
   SET  i2 = i2 + 1;
  END WHILE;
  SET  i = i + 1;
 END WHILE;

 SELECT r;
 END$$
DELIMITER ;

CALL fill_junc_table();

Edit (new idea):

I am going to test another way to solve this problem by fully copying the companies_1 table with the following columns (company_id empty on copy):

---------------------------------------------
| company_id   | category_id  |  name       |
---------------------------------------------

Then, I will loop through the companies_2 table to fill the correct company_id related to the name-column.

I hope you can give your thoughts about this. When I finish my test I will leave the result over here for others.

Justin La France
  • 789
  • 8
  • 21

2 Answers2

2

To clarify, I don't see any PIVOT transformation in the company_categories. What I see is you want a JUNCTION TABLE because it seems that companies and categories tables have many-to-many relationship.

In your case, you have company which has multiple categories. And you also have categories assigned to multiple companies.

Now base from your requirement:

I want to move the links of the categories from the table companies_1 into the company_categories table. The company_id in the company_categories table need to be equal to the id of the companies_2 table. The records of the companies_1 and the companies_2 table are linked by the "name"-column.

I arrived with this query:

INSERT INTO company_categories (company_id, category_id)
SELECT C2.id
    , C1.category_id
    FROM companies_1 C1
    INNER JOIN companies_2 C2 ON C2.name = C1.name

Let me know if this works. The nested loops that you created will really take a while.

As @DanielE pointed out, this query will work in the assumption that company_categories is empty. We will need to use UPDATE otherwise.

KaeL
  • 3,639
  • 2
  • 28
  • 56
  • There is a risk of duplicates here, no ? – Daniel E. Apr 04 '18 at 07:46
  • Which part can have duplicates? I am assuming that `companies_1` is the main table. :) – KaeL Apr 04 '18 at 07:48
  • company_categories ;) – Daniel E. Apr 04 '18 at 07:48
  • @DanielE. I mean which column will be duplicated? – KaeL Apr 04 '18 at 07:52
  • 1
    As you are doing an insert, if there is already some lines in company_categories you may duplicate lines. I will rather go on an update to avoid that – Daniel E. Apr 04 '18 at 07:54
  • 1
    I see. Let's truncate the `company_categories` then HAHA! – KaeL Apr 04 '18 at 07:55
  • 1
    Your approach is really clean! Glad to learn, My thoughts were way too difficult with the loops and the huge amounts of queries. – Justin La France Apr 04 '18 at 14:39
  • @JustDevelop thanks! Just keep on exploring and you'll learn a lot! If my answer solved your problem, you can accept it ;) – KaeL Apr 05 '18 at 01:49
  • 1
    Actually, I chucked down the data and placed it in many temporary tables to optimize performance since I'm dealing with millions of records. I used partial code from @Nick since his code was easier to chunk (the select statement felt/is easier to limit). I tried your answer on an not-indexed database and it took over 24 hours to complete LOL. Nonetheless you answer was amazing, and I will accept it since it is the cleanest way to solve the problem I stated. – Justin La France Apr 06 '18 at 13:52
  • I see, there are many factors affecting the execution plan of each queries. Your schema design, normalization, indexes are just a few :) Thanks for accepting! – KaeL Apr 07 '18 at 08:33
2

Why not just update companies_1?

ALTER TABLE companies_1 ADD (company_id INT)
UPDATE companies_1 SET company_id = (SELECT id FROM companies_2 WHERE name=companies_1.name)
ALTER TABLE companies_1 DROP name, RENAME TO company_categories
SELECT * FROM `company_categories` 

Output

id  category_id company_id  
1   1           1
2   2           1
3   1           2
4   2           2
5   3           2
6   1           3
Nick
  • 138,499
  • 22
  • 57
  • 95
  • Great solution, but the downside is that the old data is gone (not too bad, but since I was experimenting with loads of data I prefer a backup). Nonetheless I used parts of your code to accomplish my goal. Thank you for your great help! – Justin La France Apr 06 '18 at 13:54