2

i am a beginner and want to make a barcode out of this DNA sequence by using pyhton code. it's supposed to read each 1024 nucleotide and checks for mers (a combination of 4 nucleotides i.g. AAAA, AAAC, AAAG, AAAT ..... TTTT). each mer holds an index in an array (size = 256) if it found AAAA within the first 1024 it stores its count in it's index and so on, then to the next 1024 until it's done with the whole sequence. that will create a 2D array which will be turned into a png in gray scale.

my problem is that it took only the first 1024 and displayed it on the entire 1024X256 image.

DNA: https://1drv.ms/f/s!AuXxv7yqjA_FlS_ujYOMUvikWg8E

#read the DNA sequence
fasta_file = open(r'C:path\Escherichia_coli_ATCC_10798.fasta','r')
SE =fasta_file.read()
fasta_file.close()
seq = SE[177:]
dna_sequence = seq.replace("\n","")

# Sample size and mer length
#sample is the window that will go thorugh the whole sequance
sample_size = 1024
mer_length = 4

# Array to store the counts of each mer
barcode = [0] * 256

# Generate all possible 4-mers
mers = []
for i in range(256):
    mer = ""
    for j in range(4):
        mer += "ACGT"[i % 4]
        i //= 4
    mers.append(mer)


# Loop through the sample and count the occurrences of each mer
for w in range(sample_size):
    mer = dna_sequence[w:w+mer_length]
    barcode[mers.index(mer)] += 2

# Print the counts of each mer
    #print(mers[i], ":", barcode[i])
print(barcode)



# image
    # Python program to convert numpy array to image
    # import pillow library
from PIL import Image
import numpy as np

    # define a main function
    # Create the barcode array with the same shape as the desired image
code = np.array(barcode, dtype=np.uint8)

    # Create an Image object from the barcode array
image = Image.fromarray(code)

    # Reshape the image to the desired size (1024x4000)
image = image.resize((1024, 4000))

    # Save the image
image.save('Escherichia_coli.png')

    # Display the image (optional)
image.show()

the dark image is what i got the other one is what i was supposed to getenter image description here my output

i don't know how to attach the DNA sequence. some info: genome_id="531534bd23a542ae" atcc_catalog_number="ATCC 10798" species="Escherichia coli"

link to similar genome:

https://www.ncbi.nlm.nih.gov/assembly/GCA_028644645.1

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • 1
    We can't see your FASTA file. If its contents are relevant, you need to share it. If not, the code that reads it is irrelevant and needs deleting. Most folk on here have no idea what a FASTA file is, or a nucleotide, or a mer, so you need to make that part easy for us so someone who understands PIL and barcodes can help you. Otherwise you are looking for 1% of folks who know FASTA within 20% of folks who know Python within 5% of folks who know PIL within 3% of folks who know barcodes and that likely means nobody will help you... – Mark Setchell May 23 '23 at 19:23
  • is your image 256 of heigh and widdth genome size/1024 ? seems like each pixel (1024 windows of genome) has the same frequency of mers ? could it be true ? – pippo1980 May 23 '23 at 19:23
  • Mr. Mark i couldn't find a way to attach my fasta file it's on my desktop. If you can help it would be much appreciated. –  May 23 '23 at 20:15
  • Hay pippo1980! Thanks for the reply, So image = image.resize((1024, 4000)) is what i'm supposed to modify, correct? I think i also need some work on my for loops. What else do you think. –  May 23 '23 at 20:22
  • You can share via Google Drive or Dropbox or similar. – Mark Setchell May 23 '23 at 20:23
  • https://1drv.ms/f/s!AuXxv7yqjA_FlS_ujYOMUvikWg8E –  May 23 '23 at 20:29

1 Answers1

2

I'm seeing two questions here:

my problem is that it took only the first 1024 and displayed it on the entire 1024X256 image.

You'll need nested for-loops to cover the full sequence for each 4-mer. The for-loop in your code covers the first window (1024 nucleotides) only.

the dark image is what i got the other one is what i was supposed to get

The solution is to scale your count array by some large value. This will increase the brightness. I'm using 20 as an example here.

I'd also suggest using Biopython's SeqIO interface to handle fasta sequences as SeqRecord objects. Here's the updated code:

from Bio import SeqIO
from PIL import Image
import numpy as np

#read the DNA sequence
record = SeqIO.read(r'C:path\Escherichia_coli_ATCC_10798.fasta', "fasta")

# Sample size and mer length
#sample is the window that will go thorugh the whole sequance
window      = 1024
mer_length  = 4
n_windows   = int(np.ceil(len(record.seq)/window))
   
mers = []
for i in range(256):
    mer = ""
    for j in range(4):
        mer += "ACGT"[i % 4]
        i //= 4
    mers.append(mer)

# Array to store the counts of each mer 
# Each row represents 1024 nucleotides, last row counts are handled by SeqIO 
# even if the remaining window length is less than 1024
# 256 cols (each col represents a 4-mer)
image_array = np.zeros((n_windows, len(mers)), dtype=int)

# Loop through the sample and count the occurrences of each mer
for w in range(0,n_windows-1):
    for i in range(len(mers)):
        image_array[w,i] = 20*record.seq[1024*w:1024*(w+1)].count(mers[i])

# image
# Create an Image object from the image_array
# image = Image.fromarray(image_array)

image = Image.fromarray(image_array.astype(np.uint8))  # modded by me because of TypeError: Cannot handle this data type: (1, 1), <i8


# Reshape the image to the desired size (1024x4000)
image = image.resize((1024, 4000))
# Save the image
image.save('Escherichia_coli.png')
# Display the image (optional)
image.show()

The above code should give you an output similar to this:

E. coli barcode image from genome

tdy
  • 36,675
  • 19
  • 86
  • 83
Writa H
  • 46
  • 6
  • mers list could be created faster using python libraries, just in case you want bigger k-mers : https://stackoverflow.com/questions/45990454/generating-all-possible-combinations-of-characters-in-a-string – pippo1980 May 24 '23 at 10:24
  • also can put mer_length in mer list generating loop , hust in case you awant to create a function taking in fast file path and mer & window lenght to generate image – pippo1980 May 24 '23 at 10:29
  • Yes, I agree. Here's a way to get the mers list using `itertools.product`: `import itertools`; `l = [list('ACGT')]*mer_length`; `mers = ["".join(i) for i in itertools.product(*l)]` – Writa H May 24 '23 at 10:29
  • I left the mers list as-is since it isn't clear if the order of the 4-mers are fixed or can be changed; can OP clarify? – Writa H May 24 '23 at 10:33
  • using barcodes to identify lets say e.coli strains works better with long or short k-mer and how long could a kmer be (at maximun the genome size itself) ?? – pippo1980 May 24 '23 at 10:36
  • 1
    Looks like k-mer length for DNA barcoding is an interesting optimization problem: What is the shortest length a k-mer can be while maximizing differences between two or more E. coli strains? Off the top of my head, it could depend on the number of variants. Fewer variants would require a longer k-mer. – Writa H May 24 '23 at 10:54