3

I have several m x n matrices of gene expression data that I want to store in MySQL.

m is approx 30,000 genes (uniquely identifiable)
n is approx 3,000 samples (mostly uniquely identifiable)

I'm not sure what the best way is to store these data. I initially read the matrices directly into MySQL tables, but I have since been told that this is not a great way to do things, since the number of columns (samples) is a variable quantity. I cannot transpose the matrices and store them that way because there are more genes than MySQL allows for when creating columns.

I've since been told that 'junction tables' might represent a better way to do this. After watching several YouTube videos on these, however, I'm none the wiser. I've also searched Google and there doesn't seem to be a tutorial on storing gene expression data in MySQL using junction tables. So, does anyone have any advice on how best to store these data? I honestly expected that there would be a huge literature on this, so if you have useful links that would also be much appreciated.

  • 2
    I guess mysql is not ready for this things, unless you find or implement some custom storage engine. Did you look at github? something like this, maybe - https://github.com/phenotips/phenotips . Also I hope you plan to implement some genetical totaliaristic new world order, so we can fight against your reign of terror – strangeqargo May 03 '16 at 13:30
  • 1
    Medical information is usually stored in EAV data model when using relational databases. That model basically "swaps" columns for rows. This model is extremely difficult to use for regular relational operations (joins, searches, etc) but it's extremely efficient for storing lots of data with unknown number of attributes (columns). Googling for "EAV MySQL" will yield sufficient info. – Mjh May 03 '16 at 13:40
  • @Mjh thanks, but this doesn't sound quite right. According to the wikipedia article, this EAV model is best for sparse matrices, but the matrices I have are dense (indeed, they have no empty values). Moreover, I can't make the genes the columns, since there are too many of them for MySQL's self-imposed limits. –  May 03 '16 at 13:51
  • 1
    @strangeqargo I think you may be right. I will check out this resource though, I didn't actually think to check github. This is the new world order v1.0, but it might not even get that far if I can't store all the data. At the moment it's all in excel :o –  May 03 '16 at 13:54
  • 1
    Well yes, you can't make the genes as columns, but you can make them as rows. This is where EAV kicks in, instead of adding a column to a table, you add a row assigned to an entity. If you have an entity called `Gene 001` and you require an additional "column", you'd add a row in `attributes` table, reference it with `gene_id` and enter its description + value. This is how you achieve the effect of adding attributes using relational database. Accepted answer actually uses this schema, by connecting genes to samples. – Mjh May 03 '16 at 14:02

1 Answers1

3

You need just a few tables for this, I am using mysql syntax:

CREATE TABLE genes (
`gene_id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
`gene_name` varchar(99) not null
)ENGINE=InnoDB;

CREATE TABLE samples (
`sample_id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
`sample_name` varchar(99) not null
)ENGINE=InnoDB;

CREATE TABLE gene_sample (
`gene_id` INT NOT NULL,
`sample_id` INT NOT NULL,
FOREIGN KEY(`gene_id`) REFERENCES genes (`gene_id`),
FOREIGN KEY(`sample_id`) REFERENCES sample (`sample_id`),
)ENGINE=InnoDB;

For every gene that occurs in a sample, insert the pair of gene_id and sample_id into the gene_sample table.

Use two JOIN expressions in a SELECT to reconstruct the full data:

SELECT genes.*, samples.* 
FROM gene_sample 
LEFT JOIN genes USING (gene_id) 
LEFT JOIN samples USING (sample_id);
Adder
  • 5,708
  • 1
  • 28
  • 56
  • thanks I think I understand now! So the gene_sample table has 3 columns, but m x n rows, corresponding to each unique gene-sample pair? Ingenious. –  May 03 '16 at 13:57
  • You can do the gene_sample table with 2 or 3 columns, the 3 column variant having an id that you can use to delete it. You can also attach the value in the table as a 3rd column. In the code above I was just assuming you have just 1s and 0s in the table, making a value field unnecessary. – Adder May 03 '16 at 14:05
  • A quick follow up, I note that the join expressions recreate the full data, but not in the original m x n form. Is it possible to do this? –  May 03 '16 at 14:58
  • 1
    Yes with a programming language, and some performance killing ordering: (ORDER BY sample_id, gene_id) – Adder May 03 '16 at 15:00