-2

I've got a PHP+mySQL website that shows data of 300,000 products: name, description, how to fix it, where to buy it, etc.

Initially, the mySQL table is designed in the following way:

- product_id: MEDIUMINT
- category_id: SMALLINT
- weight_kg: SMALLINT
- height_cm: SMALLINT
- width_cm: SMALLINT
- depth_cm: SMALLINT
- name: VARCHAR(100)
- label: VARCHAR(100)
- short_description: VARCHAR(200)
- long_description: VARCHAR(9999)
- how_to_fix_it: VARCHAR(9999)
- where_to_buy_it: VARCHAR(9999)
- similar_products: VARCHAR(9999) <--- it stores "ACME 12|acme-12#ACME 555|acme-555" to link directly other products, avoiding to do subsequent queries to find the names and labels of the similar products.

I've found that the size of the table is huge, mainly due to the existence of the text fields (VARCHAR-9999), and this can slow down queries and website speed. I think that these VARCHAR-9999 fields are not necessary to be inside the table, since I do not perform SQL operations with them. I only query them when displaying information on a single product.

I'm considering the creation of 300,000 text files (product_id.txt) to be read by PHP, each one storing the text information (one field per line):

long_description: blah blah
how_to_fix_it: blah blah
where_to_buy_it: blah blah
similar_products: blah blah

Each time I display information of a single product, I would PHP-read the text file from disk, and I think it would be faster.

I would appreciate any similar experience dealing with this issues. Is there any smart way to store the 300,000 files? I was thinking about 300 subdirectories ('000', '001', ..., '299') to store 1,000 files in each one, to make the read faster.

Hookstark
  • 1,097
  • 3
  • 11
  • 23
  • 6
    Things do not become "slow" due to sheer database size alone, this is often much more a question of whether your data is properly _indexed_ to begin with. – CBroe Feb 24 '23 at 12:35
  • 7
    Storing comma- (or in this case pipe symbol-) separated data was never a good idea to begin with, this should have been properly normalized in the first place. – CBroe Feb 24 '23 at 12:36
  • 6
    `similar_products` it's better to use a relation table – Simone Rossaini Feb 24 '23 at 12:37
  • 3
    As you haven't shared a single line of code (neither PHP nor MySQL related one), it's impossible to provide help. You could start by sharing one of the slow queries and the corresponding execution plan – Nico Haase Feb 24 '23 at 13:11
  • The main points have already been made, so let me point you in the direction of caching. Make a sane database model first, no delimited data. Then make your queries and data model using that. If you run into speed issues, don't change your storage, come up with a caching strategy. Symfony has an awesome [caching component](https://symfony.com/doc/current/components/cache.html) that you can use without the full framework (we even use it on WordPress sites). You can cache to Redis, a database, or disk as a generally worst-case scenario, and it will take care of organization for you. – Chris Haas Feb 24 '23 at 14:13

2 Answers2

4

I've found that the size of the table is huge, mainly due to the existence of the text fields (VARCHAR-9999)

Unless you are padding those strings out to 9999 characters, your VARCHAR(9999) does not use more space to store a 34-character string 'ACME 12|acme-12#ACME 555|acme-555' than a VARCHAR(256) would use. That's the VAR part — it stores a variable-length string in a variable amount of space, but no more space than needed. The 9999 only allows longer strings, it doesn't use that much space for every short string.

You can verify the average row length using SHOW TABLE STATUS LIKE 'mytable'\G in the MySQL client. That reports an estimate of the row count and the average row size, based on a sample of the pages in your table. It's approximate, but it's usually pretty close. I predict it will show that your rows are not as large as you think they are.

I would PHP-read the text file from disk, and I think it would be faster.

That solution will not be faster for what I would expect your typical queries are. Reading files from disk is slower than reading from RAM by several orders of magnitude, and MySQL caches pages of your database in RAM as much as possible. It's pretty good at this, and represents many years of fine-tuning by hundreds of engineers. I doubt you can do better in your lifetime (I'm not insulting you, I don't think any single developer can do that much work alone).

That said, it might be faster for certain specific uses of the data (for example counting products). You haven't described any queries, so we can't guess how you will use the data.

Any optimization strategy depends on which queries you will be executing, because every optimization strategy optimizes for one type of query at the expense of making others less optimal.

By all means, it's a good learning experience to implement your own database, and make it as reliable and flexible as an RDBMS. You will learn a lot.

Is there any smart way to store the 300,000 files?

Yes — in a database! :-)

I was thinking about 300 subdirectories ('000', '001', ..., '299') to store 1,000 files in each one, to make the read faster.

It depends on the filesystem. Some have a big performance problem with hundreds of thousands of files. I did once implement a system to collate thousands of files into a hierarchy of subdirectories ... that was in 1992, but I dare say filesystems have advanced since then! Modern filesystems should be able to handle much greater numbers of files.

It's not necessarily a win to separate your data into 300,000 files. What if you need to query the average size of your products? You'd have to open 300,000 files and read all of them. Have you measured the overhead of opening that many file descriptors on your operating system? Would it not be better to store them all in a single file if that's a query you need to execute, and opening that many files is too time-consuming?

Can you even open that many files in a single process? For example in Linux this is limited by the ulimit setting on your operating system. What is your ulimit value?

This is what I mean by the optimization depends on which queries you need.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
2

Look at it this way. Fetching stuff from disk involves these steps:

  1. Wander through the directories to get to the desired directory.
  2. Read it.
  3. Wander through the directory to find the individual file.
  4. Read it.
  5. Format it and send it to the user.

For a file on disk, steps 1,2,3,4,5 are needed. Steps 1 and 2 might be cached by the OS.

For a row in a database table, the table is already open, so 1 and 2 are already done (except when you first start the system). 3 and 4 become "Drill down the BTree to the row desired. 3 is likely to be cached.

Using a database is a slam-dunk winner.

If you are having performance problems, we really need to see the queries that are so slow. Perhaps a FULLTEXT index would be very beneficial.

Rick James
  • 135,179
  • 13
  • 127
  • 222