0

I have 2 tab separated text file. One of them is called major and the other one is called minor. These are 2 small examples of files:

major:

chr1    +   1071396 1271396 LOC
chr12   +   1101483 1121483 MIR200B

minor:

chr1    1071496 1071536 1
chr1    1071536 1071566 0
chr1    1073566 1073366 1
chr12   1101487 1101516 0
chr12   1101625 1101671 1

I want to make a new file from these 2 files. In fact, I have to follow the following steps to get the final file:

step1: divide the difference between columns 3 and 4 in major file into 100. In this step I make a new file from the major file in which the number of rows would be 100 times as many as the number of rows in major file. In this new file, there would be 2 changes.

1st: columns 3 and 4 will be changed
2nd: I will add a new column called part (in this file that would be part 1 to part 100 per row in major file)



(1071396−1271396)÷100 = 2000 ----> this would be the new difference between columns 3 and 4 

chr1    +   1071396 1073396 LOC LOC_part1
chr1    +   1073396 1075396 LOC LOC_part2
.
.
.
chr1    +   1269396 1271396 LOC LOC_part100
chr12   +   1101483 1101683 MIR200B MIR200B_part1
chr12   +   1101683 1101883 MIR200B MIR200B_part2
.
.
.
chr12   +   1121283 1121483 MIR200B MIR200B_part100

From now this new file would play the role as our major file for the next step. I name that new_major.

step2: count the number of lines in minor file which match to every line in new_major file looking at the following conditions:

A) column 1 in minor file == column 1 in new_major
and
B) (column3 of new_major) <= (column2 of minor file) <= (column4 of new_major)
and
C)(column3 of new_major) <= (column3 of minor file) <= (column4 of new_major)

step3: make the final tab separated file with 7 columns. The first 6 columns would be like new_major file and the `7th column would be the counts from step 2.

The expected output would be like this:

expected output:

chr1    +   1071396 1073396 LOC LOC_part1   2
chr1    +   1073396 1075396 LOC LOC_part2   1
.
.
.
chr1    +   1269396 1271396 LOC LOC_part100 0
chr12   +   1101483 1101683 MIR200B MIR200B_part1   2
chr12   +   1101683 1101883 MIR200B MIR200B_part2   0
.
.
.
chr12   +   1121283 1121483 MIR200B MIR200B_part100 0

I wrote the following code to get the expected output but it gives an error. The error is after the code.

major = open('major.txt', 'rb')
minor = open('minor.txt', 'rb')

minor = []
for line in minor:
    minor.append(line)

major = []
for line in major:
    major.append(line)


new_major = []
for i in major:
    percent = (i[3]-i[2])/100
    for j in percent:
        new_major.append(i[0], i[1], i[2], i[2]+percent, i[4], i[4]_'part'percent[j])


new_major, minor = ([l.split() for l in d.splitlines()] for d in (new_major, minor))

for name_major, sign, low, high, note in major:
    parts = list(range(int(low), int(high) + 1, (int(high) - int(low)) // 100))
    for part, (low, high) in enumerate(zip(parts, parts[1:]), 1):
        count = sum(1 for name_minor, n1, n2, _ in minor if name_major == name_minor and all(low <= int(n) <= high for n in (n1, n2)))
        print('\t'.join((name_major, sign, str(low), str(high), note, '%s_part%d' % (note, part), str(count))))

Here is the error I got:

>>> for name_major, sign, low, high, note in major:
...     parts = list(range(int(low), int(high) + 1, (int(high) -
int(low)) // 100))
...     for part, (low, high) in enumerate(zip(parts, parts[1:]), 1):
...         count = sum(1 for name_minor, n1, n2, _ in minor if
name_major == name_minor and all(low <= int(n) <= high for n in (n1,
n2)))
...         gg = ('\t'.join((name_major, sign, str(low), str(high),
note, '%s_part%d' % (note, part), str(count))))
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack

Do you know how to fix the problem?

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
elly
  • 317
  • 1
  • 2
  • 11
  • Who approved this edit ? – Dinko Pehar Nov 18 '18 at 08:36
  • 1
    @DinkoPehar: OP can edit without review. Thanks for catching this. – Cris Luengo Nov 18 '18 at 15:30
  • 1
    @elly: This is considered vandalism here. When you post a question, it belongs to the site. If someone put effort into answering it, you cannot just throw away that person’s efforts. If there is an IP or similar issue as to why this cannot remain here, contact the company, they can remove posts from the system. – Cris Luengo Nov 18 '18 at 15:34

1 Answers1

1

I think you wanted to unpack new_major instead of the major that is just the file reader at the beginning of the python file.

for name_major, sign, low, high, note in new_major:

Be sure to also close the file with file_object.close() to release resources.

Dinko Pehar
  • 5,454
  • 4
  • 23
  • 57