How to store data with dynamic number of attributes in a database

Question

I have a number of different objects with a varying number of attributes. Until now I have saved the data in XML files which easily allow for an ever changing number of attributes. But I am trying to move it to a database.

What would be your preferred way to store this data?

A few strategies I have identified so far:

Having one single field named "attributes" in the object's table and store the data serialized or json'ed in there.
Storing the data in two tables (objects, attributes) and using a third to save the relations, making it a true n:m relation. Very clean solution, but possibly very expensive to fetch an entire object and all its attributes
Identifying attributes all objects have in common and creating fields for these to the object's table. Store the remaining attributes as serialized data in another field. This has an advantage over the first strategy, making searches easier.

Any ideas?

What's the motivation for moving to a db solution? You said below, that speed is your main concern. Do you think a db solution would be faster than your current XML approach? — Corey Trager, Sep 18 '09 at 13:23
A fourth strategy came to mind with DVK proposing the same: Storing common attributes in the objects table and storing all other using an 1:n relation a second table. Seems like the best compromise for speed, flexibility and clean solution (@Tobiask) — Jörg, Sep 18 '09 at 13:28
@Corey, no I don't. The XML solution is incredible fast at the moment and I don't think a DB could keep up with that. It is more of an exercise to my, trying to make the system more flexible in regards to storage options and at the same time improve my MySQL skills. — Jörg, Sep 18 '09 at 13:30

paxdiablo · Accepted Answer · 2019-04-30T05:30:17.300

31

If you ever plan on searching for specific attributes, it's a bad idea to serialize them into a single column, since you'll have to use per-row functions to get the information out - this rarely scales well.

I would opt for your second choice. Have a list of attributes in an attribute table, the objects in their own table, and a many-to-many relationship table called object attributes.

For example:

objects:
    object_id    integer
    object_name  varchar(20)
    primary key  (object_id)
attributes:
    attr_id      integer
    attr_name    varchar(20)
    primary key  (attr_id)
object_attributes:
    object_id    integer  references (objects.object_id)
    attr_id      integer  references (attributes.attr_id)
    oa_value     varchar(20)
    primary key (object_id,attr_id)

Your concern about performance is noted but, in my experience, it's always more costly to split a column than to combine multiple columns. If it turns out that there are performance problems, it's perfectly acceptable to break 3NF for performance reasons.

In that case I would store it the same way but also have a column with the raw serialized data. Provided you use insert/update triggers to keep the columnar and combined data in sync, you won't have any problems. But you shouldn't worry about that until an actual problem surfaces.

By using those triggers, you minimize the work required to only when the data changes. By trying to extract sub-column information, you do unnecessary work on every select.

edited Apr 30 '19 at 05:30

answered Sep 18 '09 at 13:17

paxdiablo

854,327
234
1,573
1,953

Exactly my concern with the first strategy. – Jörg Sep 18 '09 at 13:22
the question is which is better for performance your method is eav what is your opinion about json modeling storing data – babak faghihian Aug 31 '15 at 06:53
@babakfaghihian, I think I cover that in my final two paragraphs, yes? It's okay to break 3NF for performance provided you understand and mitigate the risks (of data elements "disagreeing" with each other). Storing the original data (XML, JSON or whatever) is one approach to this. – paxdiablo Sep 01 '15 at 01:37
@paxdiablo where we can save a value for an attribute in this model? – Fadi Dec 23 '18 at 11:12
1

@fadi, most likely as a extra value column in the `object_attributes` table. I didn't add *all* of the possible columns since my intent was really just to show the relationship fields. I've now added that in for this specific use case. – paxdiablo Dec 23 '18 at 13:22
@paxdiablo : what if the attributes are of different type ? Will we add one more column "attribute_type" and store the attribute value always as varchar(20) and then type cast in the app layer ? – j10 Apr 24 '19 at 08:17
1

@j10, that's certainly one way to do it, and the schema will support it, with effort from the application (such as your type casting). I really just designed this for string attributes and, if there's going to be lots of types, you may want to consider other approaches. – paxdiablo Apr 24 '19 at 12:37
@paxdiablo even though this is fairly old now, I found it's similar to a challenge I've been facing, I couldn't make the exact analogy to my current problem. From what I understand, if the uniqueness is only guaranteed by the dimensions, the objects table would contain duplicate data, which makes me wonder if this is the way to go. I posted a question here (https://stackoverflow.com/questions/72560317/dealing-with-dynamic-dimensions-for-metrics-data-analytics), any inputs are appreciated, thanks! – lowercase00 Jun 13 '22 at 21:14

score 7 · Answer 2 · answered Sep 18 '09 at 13:15

7

A variation on your 2d solution is just two tables (assuming all attributes are of a single type):

T1: |Object data columns|Object_id|

T2: |Object id|attribute_name|attribute value| (unique index on first 2 columns)

This is even more efficient when combined with 3rd solution, e.g. all of the common fields go into T1.

Sstuffing >1 attribute into the same blob is no recommended - you can not filter by attributes, you can not efficiently update them

answered Sep 18 '09 at 13:15

DVK

126,886
32
213
327

Actually, that's exactly what came to my mind after reading through my three strategies again. Sounds like the best way to go. – Jörg Sep 18 '09 at 13:26
1

Hi. This is called a Entity-Atribute-Value table, and it is bad design http://programmers.stackexchange.com/questions/93124/eav-is-it-really-bad-in-all-scenarios – Gabri Botha Sep 13 '16 at 09:04
1

@GabriBotha - the answers on the linked question in no way support your flat and unnuanced assertion that it's a "bad" design. It's a design with specific flaws - like ALL designs are - and specific situations where it's the best approach. – DVK Sep 13 '16 at 11:26

score 3 · Answer 3 · edited Feb 26 '13 at 12:38

Let me give some concreteness to what DVK was saying.

Assuming values are of same type the table would look like (good luck, I feel you're going to need it):

dynamic_attribute_table
------------------------
id         NUMBER
key        VARCHAR
value      SOMETYPE?

example (cars):

|id|    key   |   value   |
---------------------------
| 1|'Make'    |'Ford'     |
| 1|'Model'   |'Edge'     |
| 1|'Color'   |'Blue'     |
| 2|'Make'    |'Chevrolet'|
| 2|'Model'   |'Malibu'   |
| 2|'MaxSpeed'|'110mph'   |

Thus,
entity 1 = { ('Make', 'Ford'), ('Model', 'Edge'), ('Color', 'Blue') }
and,
entity 2 = { ('Make', 'Chevrolet'), ('Model', 'Malibu'), ('MaxSpeed', '110mph') }.

what if you want to say a machine have black color and yellow color ? — babak faghihian, Sep 26 '15 at 08:29

score 2 · Answer 4 · answered Sep 18 '09 at 13:19

If you are using a relational db, then I think you did a good job listing the options. They each have their pros and cons. YOU are in the best position to decide what works best for your circumstances.

The serialized approach is probably the fastest (depending on your code for de-serializing), but it means that you won't be able to query the data with SQL. If you say that you don't need to query the data with SQL, then I agree with @longneck, maybe you should use a key/value style db instead of a relational db.

EDIT - reading more of your comments, WHY are you switching to a db if speed is your main concern. What's BAD about your current XML implementation?

I guess, that `current XML implementation`, doesn't cope with growing data amount. XML(s) can accumulate up to few GB(s) of size for large amount of objects/entities,.. And, XML(s) aren't searchable structures, in optimized way. So, either in-memory-custom DB with indexes, or abuse of existing relational database. — kravemir, Aug 11 '19 at 11:18

score 2 · Answer 5 · edited May 23 '17 at 10:31

I used to implement this scheme:

t_class (id RAW(16), parent RAW(16)) -- holds class hierachy.
t_property (class RAW(16), property VARCHAR) -- holds class members.
t_declaration (id RAW(16), class RAW(16)) -- hold GUIDs and types of all class instances
t_instance (id RAW(16), class RAW(16), property VARCHAR2(100), textvalue VARCHAR2(200), intvalue INT, doublevalue DOUBLE, datevalue DATE) -- holds 'common' properties

t_class1 (id RAW(16), amount DOUBLE, source RAW(16), destination RAW(16)) -- holds 'fast' properties for class1.
t_class2 (id RAW(16), comment VARCHAR2(200)) -- holds 'fast' properties for class2
--- etc.

RAW(16) is where Oracle holds GUIDs

If you want to select all properties for an object, you issue:

SELECT  i.*
FROM    (
        SELECT  id 
        FROM    t_class
        START WITH
                id = (SELECT class FROM t_declaration WHERE id = :object_id)
        CONNECT BY
                parent = PRIOR id
        ) c
JOIN    property p
ON      p.class = c.id
LEFT JOIN
        t_instance i
ON      i.id = :object_id
        AND i.class = p.class
        AND i.property = p.property

t_property hold stuff you normally don't search on (like, text descriptions etc.)

Fast properties are in fact normal tables you have in the database, to make the queries efficient. They hold values only for the instances of a certain class or its descendants. This is to avoid extra joins.

You don't have to use fast tables and limit all your data to these four tables.

Wow, that's even one step further. But wouldn't that lead to a large number of tables if you have one for each class type? And your last SQL statement really makes me hope the MySQL book I ordered will arrive sooner rather than later.. — Jörg, Sep 18 '09 at 13:25
`@Jörg`: this was in `Oracle` and this is `Oracle` syntax. In `MySQL`, you'll need to implement this function in a little other way: http://explainextended.com/2009/03/17/hierarchical-queries-in-mysql/ — Quassnoi, Sep 18 '09 at 13:28
You need to create tables only for "fast properties": when you need to create a composite index on two or more properties. Otherwise, you can only have `4` basic tables. — Quassnoi, Sep 18 '09 at 13:29

score 1 · Answer 6 · answered Sep 18 '09 at 13:14

1

sounds like you need something lick couchdb, not an RDBMS.

answered Sep 18 '09 at 13:14

longneck

11,938
2
36
44

That sounds like an ideal solution. Unfortunately I am mostly dealing with scenarios where I don't have a possiblity to make use of much beyond MySQL, let alone installing another DB on the server. – Jörg Sep 18 '09 at 13:17

score 1 · Answer 7 · answered Sep 18 '09 at 13:18

if you are going to edit/manipulate/delete the attributes in later point, making a true n:m (second option) will be the one which I go for. (Or try to make it 2 table where the same attribute repeats.But data size will be high)

If you are not dealing with attributes(just capturing and showing the data) then you can go ahead and store in one field with some separator(Make sure the separator wont occur in the attribute value)

score 1 · Answer 8 · answered Aug 27 '19 at 23:04

I am assuming you do not have digital attribute soup, but that there is some order to your data.

Otherwise, an RDBMS might not be the best fit. Something along NO SQL might work better.

If your objects are of different types, you should generally have one table per type.

Especially if you want to connect them using primary keys. It also helps to bring order and sanity if you have Products, Orders, Customers, etc tables, instead of just an Object and Attribute table.

Then look at your attributes. Anything that exists more than, say for 50% of the objects in that type category, make it a column in the object's table and use null when it's not being used.

Anything that is mandatory, should, of course, be defined as a NOT NULL column.

The rest, you can either have one or several "extra attributes" tables for.

You could put the attribute names into the table with the values, or normalize them out in a separate table and only use the primary key in the value table.

You may also find that you have combinations of data. For instance, a variant of an object type always has a certain set of attributes while another variant of the same object type has another set of attributes.

In that case, you might want to do something like:

MainObjectTable:
  mainObjectId: PRIMARY KEY
  columns...
MainObjectVariant1Table:
  mainObjectId: FOREIGN KEY TO MainObjectTable
  variant1Columns...
MainObjectVariant2Table:
  mainObjectId: FOREIGN KEY TO MainObjectTable
  variant2Columns...

I think the hard work, that will pay off, in the long run, is to analyze the data, find the objects and the commonly used attributes and make it into a good "object/ERD/DB" model.

How to store data with dynamic number of attributes in a database

8 Answers8

Linked

Related