-2

I reckon my question wasn't all clear. So, time for another approach in explaining my quest.

  • I've got a single data file that contains about 27,500 data points. Every datapoint has a unique integers between 0 and 46041637 in the first column and a description in column 2 - 6.
  • I would like to know how the integers are distributed across the 46M possibilities. For instance, how many (and which) datapoints fall within the range 1-1000, how many between 1001-2000 and so on until 46M is reached.
  • In the previous example, roughly 46K (46M/1000) smaller datasets (or bins) will be created. Some bins will contain several datapoints while a lot of them won't contain any datapoints at all.
  • In the example above, the size/length of the bin (I would call it windowsize) is 1000. I do not know yet what the best windowsize is to suit my purposes, so, I would like to be able to have a script that has an 'adjustable' windowsize.
  • Furthermore, the example 1-1000, 1001, 2001, [...], clearly doesn't show any overlap in the bins. However, that means that I lose a lot of 'sensitivity' and therefore knowledge/information. So I would like to be able to create windows/bins that overlap. For instance, 1-1000,501-1500, 1501-2500, 2001-3000. These bins have an overlap of 500. I would to be able to set the amount of overlap between the bins.
  • I would like to write every bin to its own file, even if the bin doesn't contain any datapoints.

Here is the explanation I gave before, which wasn't very good apparently


Using AWK I'm trying to 'slide a window' over a list of integers. If I would split up this dataset how many datapoints would there be in every (possible overlapping) bin? I like to set the binsize (or windowsize) and overlap between the bins. This approach enables me to get an idea of the local datapoint density. --> I have a little bit of AWK experience and I've been told tha AWK should be able to do the job, I prefer to use AWK. However, I'm also open to other ideas (Python for example).

  • I've got a single data file that contains about 27,500 integers between 0 and 46041637 (46 million) and a description for every data point.
  • As I would like to have an idea of the effects of 'changing' the resolution I would like to play a little with different window-sizes and the overlap between individual windows.
  • I like to write all the contents of every single 'window' to a separate file and name the file according to the 'windows-startpoint'.

I've prepared some example files which are attached below. However, to make it easier to get an idea here's another very very simplified example:

Datarange 1 - 10
Integers in dataset: 2,4,5,6,9
####
####Example 1: Windowsize=5,overlap=2####
####
file name ="1" contents are: (range is 1 - 5)
2
4
5
file name="3" contents are: (range is 4 - 8, that is, two overlap with previous range)
4
5
6
file name="7" contents are: (range is 7 - 10, if the range was larger, it would be 7 - 11)
9
####
####Example 2: Windowsize=3,overlap=0####
####
file name="1" contents are (range 1 - 3)
2
file name="4" contents are (range 4 - 6)
4
5
6
file name="7" contents are (range 7 - 9)
9
file name="9" contents are (range 10 - 10)
<none>

Example input file

    3579                    
    3661                    
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
    167942  5P_UTR  167930  167956  -   Solyc06g005140.2.1.0Solyc06g005140.2.1
    168020  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1
    168153  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1

Example output file with different windowsizes and overlap

> AWK -v windowsize=50000 -v overlap=0 -f awkscript input.file
> ls
1      50001
100001 150001
> cat 1
    3579                    
    3661                                        
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1                  
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
 > cat 50001

 > cat 100001

 > cat 150001
    167942  5P_UTR  167930  167956  -   Solyc06g005140.2.1.0Solyc06g005140.2.1
    168020  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1
    168153  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1

> #And with some different paramenters
> AWK -v windowsize=160000 -v overlap=10000 -f awkscript input.file
> ls
1    10001
> cat 1
    3579                    
    3661                    
    3752    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    3947    EXON    3706    4407    +   Solyc06g005000.2.1.1Solyc06g005000.2.1
    6734    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6865    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    6915    INTRON  5605    7662    +   Solyc06g005000.2.1.2Solyc06g005000.2.1
    8961                    
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
> cat 10001
    13471   EXON    13449   13532   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    13561   INTRON  13533   13710   +   Solyc06g005020.1.1.2Solyc06g005020.1.1
    22226   EXON    22106   22261   +   Solyc06g005030.1.1.1Solyc06g005030.1.1
    22516                   
    22556                   
    36903   INTRON  36836   36915   +   Solyc06g005060.2.1.1Solyc06g005060.2.1
    37377   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37605   EXON    36916   37800   +   Solyc06g005060.2.1.2Solyc06g005060.2.1
    37935   3P_UTR  37801   38132   +   Solyc06g005060.2.1.0Solyc06g005060.2.1
    167942  5P_UTR  167930  167956  -   Solyc06g005140.2.1.0Solyc06g005140.2.1
    168020  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1
    168153  INTRON  167957  169025  -   Solyc06g005140.2.1.2Solyc06g005140.2.1

Thank you so much for all your help!


Little adjustment of my initial question because I requires way more computing time than I anticipated for.

Is it possible that, instead of writing all the records that fall in a particular window to its own file, write the 'statistics' of each window to a row in a table? With statistics I mean, how many records does a particular window contain and how many of each type. Applied to the example above this would look like this:

> python script.py 160000 10000 file (using the script from sidharth c nadhan)
> cat result
window | total | exons | intron | 3P_UTR | 5P_UTR
1      |  17   |   6   |   5    |    1   |   0
10001  |  12   |   4   |   4    |    1   |   1 
Elmer
  • 255
  • 1
  • 2
  • 10
  • Changed my question, hope it's more 'understandable' now? Please let me know! Thanks a lot all of you! – Elmer May 20 '13 at 14:08

1 Answers1

1

Try this :

import sys,os,collections
list1,set1=list(),set()
dict1 = collections.defaultdict(list)
dict2 = collections.defaultdict(int)
wind , overl, maxim= int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[4])
for line in open(sys.argv[3]):
  try : set1.add(line.split()[1])
  except : pass
  for i in xrange(1,maxim,wind-overl):
    if int(line.split()[0]) in xrange(i,wind+i): dict1[i].append(line)
print "Window\tTotal\t",
for entri in set1 : print entri,"\t",
print "\n",
for j in xrange(1,maxim,wind-overl):
  dict2.clear()
  for line in dict1[j]:
    try : dict2[line.split()[1]] +=1
    except: pass
  print j,"\t",len(dict1[j]),"\t",
  for entri in set1:
    print dict2[entri],"\t",
  print "\n",

Usage : python script.py window overlap file maxim

Where maxim refers to the largest integer in the input file. maxim = 168153 in the sample input file. Giving it as a command line argument improves computational speed.

Sidharth C. Nadhan
  • 2,191
  • 2
  • 17
  • 16
  • Awesome! Thanks a lot, I tried to express my question better, it's difficult for me to explain but it's a good lesson. Your script is running right now, with a windowsize of 1M and a overlap of 0. However, it's taking a lot of CPU power and time so I can't give you any feedback but nevertheless, thanks a lot already! – Elmer May 20 '13 at 14:18
  • I thinks it's not working as supposed, my CPU is running at >80 degrees C right now and there's zero output... – Elmer May 20 '13 at 14:25
  • Can you try out with a small sample file and test first. http://ideone.com/8eyyBn Maybe you can modify this to fit as ur needs – Sidharth C. Nadhan May 20 '13 at 15:22
  • it's working now! I split my up datafile in a couple of smaller pieces and now it's working all good! Thanks again! I actually do have another question you might be able to help me with, a little modification of the script above: do think it is possible to, instead of printing the whole contents of the window in a new file, write the 'statistics' of this file to a row in a table? With statistics do I mean, how many integers does this window contain, and how many of each category (ie. intron, exon, 3P_UTR, 5P_UTR) I have added an example below my original question. – Elmer May 21 '13 at 17:04
  • By the way, you've totally understood my question in the beginning. Bravo! I've got respect for that because it was a difficult was of asking a question. – Elmer May 21 '13 at 17:21
  • 1
    I have modified the script to print the statistics. Hope it works. Is it possible to pass maxim as command line argument ? By doing that speed can be increased considerably. – Sidharth C. Nadhan May 22 '13 at 08:53
  • Thank you very very much! Yes, that's certainly possible, I already modified the previous script to pass the maxim as command line argument, so no problem that you did it here. Actually much better! Thanks a lot! You totally make my day! – Elmer May 22 '13 at 09:54