-1

I am pretty new to R. So I apologize for asking maybe a very basic question. Let's say I have a fasta file with sequence below:

>sequence_1
ACCTGC--A
>sequence_2
ACC-GCTTA
>sequence_3
ACCTGCTTA

Is there a function or method to calculate entropy for each column in R? I did some research and found a function that calculate entropy for Amino Acids but not for DNA nucleotides.

I would like to have the output be stored in a vector or in a list.

Thanks for the help in advance!

Xiaoxixi
  • 91
  • 8
  • 1
    what is entropy for this kind of data? I am not very good with biology or bioinformatics, so you can consider also posting it in https://bioinformatics.stackexchange.com/ if this is something specific – StupidWolf Jun 26 '20 at 21:10

1 Answers1

0

The package HDMD has a MolecularEntropy function that also works on DNA in a very simple manner. It ignores non-ACGT characters or gaps and seems to use the formula H = − ∑ p*log(p) p= frequence of the base to calculate entropy. So I am not sure this is what you are after.

Your simple example would yield something like this (frequencies are always 1, so entropies are all 0):

library(HDMD)
#> Loading required package: psych
#> Loading required package: MASS
fa <- c(">sequence_1", "ACCTGC--A", ">sequence_2", "ACC-GCTTA", ">sequence_3", 
        "ACCTGCTTA")
fa <- grep(">",fa, invert = TRUE, value = TRUE)
MolecularEntropy(fa, type="DNA")
#> [1] "Warning: Data set contains non-nucleotide elements"
#> $counts
#>   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#> A    3    0    0    0    0    0    0    0    3
#> C    0    3    3    0    0    3    0    0    0
#> G    0    0    0    0    3    0    0    0    0
#> T    0    0    0    2    0    0    2    2    0
#> 
#> $freq
#>   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#> A    1    0    0    0    0    0    0    0    1
#> C    0    1    1    0    0    1    0    0    0
#> G    0    0    0    0    1    0    0    0    0
#> T    0    0    0    1    0    0    1    1    0
#> 
#> $H
#> [1] 0 0 0 0 0 0 0 0 0

Created on 2020-06-26 by the reprex package (v0.3.0)

user12728748
  • 8,106
  • 2
  • 9
  • 14
  • thank you so much for your help! I have an additional basic question, the MolecularEntropy function requires a matrix as input argument. I tried the read.DNA and read.FASTA function from ape. But either worked for MolecularEntropy function. Do you have a good recommendation? Thank you so much! – Xiaoxixi Jun 27 '20 at 02:13
  • I got it to work! thank you so much for answering my questions! – Xiaoxixi Jun 27 '20 at 15:24
  • As you can see from my example, vectors (of same length, I assume), also work. So given your example file is called testfasta.fa you could read it in like this: `HDMD::MolecularEntropy(grep(">", readLines("testfasta.fa"), invert = TRUE, value = TRUE), type="DNA")`. – user12728748 Jun 27 '20 at 15:28