0

I am accessing a chado structured mysql database. I search by the gene product, for this example the product is "bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase".

I can then use JOIN statements to find on what assembly this gene is located and what its coordinates are. The below SQL statement is valid and will return the assembly's sequence (not just the gene's sequence), and the start and stop positions for the gene of interest on the assembly.

SELECT f.uniquename AS protein_accession, product.value AS protein_name, srcfeature.residues AS residue_sequence, srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

The assembly sequence is very very long and I definitely don't need all of it. Is it better to extract the part that I need using a MySQL's SUBSTRING method to save retrieving the whole sequence, or to use a programming language's substring method after retrieval? The below query is my attempt at the SUBSTRING method using values obtained during the query for position and length. It does not work, my guess is that it needs multiple SELECT statements to work. The SQL is getting really ugly, and I'm not even sure a working end result would be better.

What are your thoughts, is it better to do this with SQL SUBSTRING, or just use a programming language and a substring method to display what I want even though I have retrieved the whole thing?

SELECT f.uniquename AS protein_accession, product.value AS protein_name, SUBSTRING(srcfeature.residues AS residue_sequence, location_min, location_max - location_min), srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

EDIT Here's an example result for a different gene (shorter name). I have omitted the portion on the queried sequence as that part is thousands of characters long. I would have to use the values of location_min and location_max shown here to SUBSTRING properly.

+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| protein_accession | protein_name                                      | source_type | location_min | location_max | strand |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| ECDH10B_0026      | bifunctional riboflavin kinase and FAD synthetase | assembly    |        21406 |        22348 |      1 |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
sage88
  • 4,104
  • 4
  • 31
  • 41

2 Answers2

1

Your as was in the wrong place. It needs to go after the closing paren for the substring():

SELECT f.uniquename AS protein_accession, product.value AS protein_name,
       SUBSTRING(srcfeature.residues, location_min, location_max - location_min)  AS residue_sequence,
       srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

As for your other question, I think it makes much more sense to extract the data you want in the query, rather than passing back unnecessary data to the application. This saves on communication overhead. Plus, the database has the opportunity to run in parallel, if it is using multiple threads/processors.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • That change makes sense. Unfortunately the query still doesn't work. I think because location_min and location_max are values that are generated from that query. This is why I figured it would probably need multiple SELECT statements. One to get the values of location_min and location_max, and then one to use them to SUBSTRING srcfeature.residues. Here's the error MySQL gives me if I use that query as above: ERROR 1054 (42S22): Unknown column 'location_min' in 'field list' – sage88 Apr 23 '13 at 21:02
  • So after trying to do this passing the whole sequence back and editing it client side in javascript, using a SQL SUBSTRING is not only the better way to do this, it is the only way to do this. The sequence is so long that it fails to load properly when you try to retrieve the whole thing (and takes a good 15 seconds to display anything as well). So I think you have me pointed in the right direction. – sage88 Apr 24 '13 at 19:22
  • 1
    @sage88 . . . I think yuou can just use `location.fmin` and `location.fmax` instead of the aliases. – Gordon Linoff Apr 24 '13 at 20:01
  • Sigh, that's all it was in the end. Can't use the aliases at that point. For anyone who has a similar problem, @Gordon Linoff 's comment on using the non-aliased values will work. You don't need multiple select statements if you do it that way. – sage88 Apr 24 '13 at 20:19
0

If something like this will work for you:

SELECT f.uniquename AS protein_accession, 
       product.value AS protein_name, 
       SUBSTRING(
                   srcfeature.residues, 
                   patindex('%SOMPATTERN%',srcfeature.residues), 
                   LEN(srcfeature.residues) - patindex('%SOMPATTERN%',srcfeature.residues)
                ) AS residue_sequence, 
      srcassembly.name AS source_type, 

then try that in the SQL. If not, use the application programming language.

outis nihil
  • 736
  • 6
  • 16
  • I'm not looking for a specific pattern. I need to substring it by index value is all. I will post what a query looks like above without the sequence included to give you an idea of what I'm talking about. – sage88 Apr 23 '13 at 21:05
  • So how are you calculating your index value? What is location_min and what is location_max? If they are functions of something in the row, you need to include that function in the SUBSTRING command, as I did with PATINDEX(). In other words, if location_min = f(srcfeature.residues), and location_max = g(srcfeature.residues), you need to put SUBSTRING(srcfeature.residues, f(srcfeature.residues), g(srcfeature.residues)) in the query. They must be functions of srcfeature.residues, otherwise your location_min and location_max might be out of range for the length of srcefeature.residues. – outis nihil Apr 24 '13 at 19:00
  • If location_min and location_max are NOT functions of something in the row, you can just define them as parameters on a stored procedure or as scalar values `code`DECLARE @Location_Min int `code`SET @Location_Min = 10 – outis nihil Apr 24 '13 at 19:07
  • They're part of the SELECT statement in the query location.fmin AS location_min, location.fmax AS location_max. So they aren't functions or known before the query is performed. This is why I was thinking I would need to do multiple SELECT statements. One to retrieve the location values, and one to SUBSTRING the srcfeature.residues using the values obtained from the previous query. Currently I have bypassed the problem by just executing two separate sequential queries. – sage88 Apr 24 '13 at 19:18