How to represent DNA sequences for neural networks?

Question

I want to build a neural network to classify splice junctions in DNA sequences in Python. Right now, I just have my data in strings (for example, "GTAACTGC").

I am wondering about the best way to encode this in a way that I can process with a neural network. My first thought was to just assign each letter to an integer value, but that feels pretty naive. Another thought was to have four binary indicators for each place corresponding to either A, T, G, or C. I'm not sure how well that would work, either.

What is the best way to approach this problem? Before, I have just been working with numerical values, but have never worked with strings like DNA sequences before.

Edit: For now, I will be looking at just mapping. For anyone reading this, try looking at this paper; it definitely helped give me some pointers.

Welcome to Stack Overflow! At this site you are expected to try to **write the code yourself**. After [doing more research](https://meta.stackoverflow.com/questions/261592) if you have a problem you can **post what you've tried** with a **clear explanation of what isn't working** and providing a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). I suggest reading [How to Ask](https://stackoverflow.com/questions/how-to-ask) a good question. Also, be sure to take the [tour](https://stackoverflow.com/tour). — Spoody, Jan 17 '18 at 23:51

score 0 · Accepted Answer · answered Jan 18 '18 at 00:30

0

Why not learn the numerical representations for each base?

This is a common problem in Neural Machine Translation, where we seek to encode "words" with a meaning as (naively) numbers. The core idea is that different words should not be represented with simple numbers, but with a learned dense vector. The process of finding this vector representation is called embedding.

In this, bases that are more closely related to one other might have vector representations closer to one another in n-dimensional space (where n is the size of the vector). This is a simple concept that can be difficult to visualize the first time. Your choice of embedding size (a hyperparameter) should likely be small, since you are only embedding one of four parameters (try a size of 2-5).

As an example of some embedding mappings with size 4 (numerical values are not relevant to this example):

G -> [1.0, 0.2, 0.1, 0.2]
A -> [0.2, 0.5, 0.7, 0.1]
T -> [0.1, 0.2, 1.0, 0.5]
C -> [0.4, 0.4, 0.5, 0.8]

The exact technique of generating and optimizing embedding is a topic in itself; hopefully the concept is useful to you.

Alternative

If you want to avoid embeddings (since your "vocabulary" is limited to 4), you can assign scalar values to each base. If you do this, you should normalize your mappings between -1 and 1.

answered Jan 18 '18 at 00:30

Evan Weissburg

1,564
2
17
38

That's a really interesting concept. I think that for now, I'll try a simpler method, but in the future I'll definitely try embedding! Just one more question: what do you think about using dummy variables as opposed to mapping scalar values to each base? Which method would be cleaner/get better results for DNA? – blvejay Jan 18 '18 at 00:48
I'm not quite sure why you mean by dummy variables, but you can definitely just assign decimal values to each base and statically convert. Does this help? – Evan Weissburg Jan 18 '18 at 01:03
Okay, that helps a ton! I was trying to use the pandas method get_dummies, but I think that assigning decimal values is definitely better. Do you have any pointers to where I can learn how to do this? – blvejay Jan 18 '18 at 01:11
Hmm..not really. Learning to effectively Google and read papers on topics you have questions about is really important. Since you're new, I won't feel bad about telling you that if an answer helps, you can upvote and accept it (with the checkmark). This gives it higher visibility for people who stumble upon it later and also gives the poster internet points :) – Evan Weissburg Jan 18 '18 at 01:47
1

Okay, thank you! I'll look into it. And I accepted your post :) – blvejay Jan 18 '18 at 19:56

score 0 · Answer 2 · answered Jan 22 '18 at 21:47

An idea building on the binary indicators you mentioned would be to encode a 1024x1024 (for example) image that is a bitmap, which would normally represent 1048576 bits. For each row in the image, you can look at positions or bits in pairs and encode as follows:

0 0 -> G
0 1 -> A
1 0 -> T
1 1 -> C

which should yield 512 data values per row, or 524288 per image.

For the neural network I think you could incorporate a Convolutional neural network or CNN. Another idea might be to avoid the outer edges of the image when encoding so that any kernels used in the process will have full coverage over the active (central) data area. Maybe it isn't the most significant change but perhaps it is worth comparing to a data mapping that encodes right up to the edges and corner regions of the image.

How to represent DNA sequences for neural networks?

2 Answers2