-2

I want to make a large text file from data in two large text files (around 2 or 3 gb), using Java. I have to merge these two files into one, while comparing numbers in those text files.One file contains information such as this:

    chr1  100  200  abcd  +
    chr2  150  227  abba  +
    .......................
    .......................

It is nothing but a bed file(used in bioinformatics). And another file contains information such as this:

    >chr1:
    AATTTATTTATTTTATTTTTTTATTTACCCACCCCCCCATTATTTACCAGGGGAGGGATTT
    ATTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCCCCCCAATTTTTT...........
    .............................................................
    >chr2:
    ATTTTTTTATTTACCCACCCCCCCATTATTTACCAGGGGAGGGATTTCCCCCCCCCCCCCC
    ATTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCCCCCCAATTTTTT...........
    .............................................................
    >chr3:
    AATTTATTTATTTTATTTTTTTATTTACCCACCCCCCCATTATTTACCAGGGGAGGGATTT
    ATTTTTTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCCCCCCAATTTTTT...........
    .............................................................

It is nothing but a fasta file(used in bioinformatics) What I have to do is that I have to pick a line from bed file and need to extract sequence from the fasta file for that chromosome's start and end position (mentioned in bed files 2nd and third column) and make a file like the following:

    chr1  100  200  abcd  +  ATTTATCC.....ATTT
    chr2  150  227  abba  +  TTATCC.....ATTTCC
    ..........................................
    ..........................................

I can do it with small files and it works. I split the lines of each input file and store them in two ArrayLists. Then, I compare elements of the two ArrayLists. If the elements match, I merge the particular line of the two files.

Here is my code that works for small files:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;

public class RetrieveFromTwoFile{
    private static ArrayList<String> store(String f1) throws FileNotFoundException{

        Scanner read=new Scanner(new File(f1));
        ArrayList<String> list =new ArrayList<String>();
        while(read.hasNext()){
            String temp=read.nextLine();
            String[] sts=temp.split("\\s+");
            for(int i=0;i<sts.length;i++){
                if(!(sts[i].equals("")) && !(sts[i].equals("\n"))){

                    list.add(sts[i]);
                }

            }

        }
        return list;        

    }
    private static ArrayList<String> storeLine(String f1) throws FileNotFoundException{

        Scanner read=new Scanner(new File(f1));
        ArrayList<String> list1 =new ArrayList<String>();
        while(read.hasNext()){

            String line=read.nextLine();

            list1.add(line);

            //return list;      

        }
        return list1;       

    }

    private static void writer(ArrayList<String> out,String fname) throws IOException{

        FileWriter writr= new FileWriter(new File(fname));
        for(int i=0;i<out.size();i++){
            writr.write(out.get(i)+"\n");

        }
        writr.close();

    }

    public static void main(String [] args) throws Exception{


            ArrayList<String> file1;
            ArrayList<String> file2;
            ArrayList<String> file3;
            ArrayList<String> finl=new ArrayList<String>();
            file1=store("region.txt");//storing every chunk of strings if there is space between them in region.txt
            file2=store("specific.txt");//storing every chunk of strings if there is space between them in specific.txt
            file3=storeLine("specific.txt");//storing each line in region.txt

            for(int i=0;i<file1.size();i=i+6){//c will hold the chrome number
                long initial=Long.parseLong(file1.get(i+1));
                long end=Long.parseLong(file1.get(i+2));
                String chrom=""+file1.get(i);
                System.out.println("chrome for file1 : "+chrom);
                String region=""+file1.get(i+3);
                System.out.println("region for file1 : "+region);
                //finl.add(region);
                //finl.add(file1.get(j));
                for(int x=0,z=0;x<file2.size() && z<file3.size();x=x+6,z=z+1){
                    long res=Long.parseLong(file2.get(x+1));//resultant number in specific.txt.this number is there after 6 more elements
                    String match=file2.get(x);
                    //boo
                    System.out.println("chrom type : "+chrom+" "+match);
                    //int index=x/6;

                    if(match.equals(chrom)== true){ 
                        System.out.println("hi");                   
                        if(res>=initial && res<=end){
                        System.out.println("hi1");
                        String ress=file3.get(x/6);
                        String finress=""+region+"\t"+ress+"";//merging line from region.txt and specific.txt
                        System.out.println("Initial : "+initial+" end : "+end+" item :"+res);

                        System.out.println("The item is :"+ress);

                        finl.add(finress);//adding the mergedline in another arraylist
                        System.out.println("The item is :"+finress);
                                //System.out.println("The item is :" +finl.get(z));
                                //flag=1;
                        }
                    }


                }
                System.out.println("h2i");

            }

            for(int i=0;i<finl.size();i++){
                System.out.println("******* item is**** :"+finl.get(i));
            }
            writer(finl,"result.txt");//writing result.txt with the arraylist finl


        //}



    }
}
  • What's the format of the files ? You say you have to compare numbers, are those one for each line ? Separated by special char ? How many of these "numbers" can you have in your file? Do you need to consider dupe values too ? – BigMike Sep 09 '15 at 07:10
  • Compiling for 64 bit might work, but it could become slow if you have less RAM than 6.5 GB (file sizes + 512 MB for Java + 1 GB for OS, roughly). – Thomas Weller Sep 09 '15 at 07:17
  • If it's working for small files, but not those large ones, it seems you're lacking memory (or more succinctly overusing the memory you have). Don't read the entire files into memory, or split them up into manageable chunks. – Tim S. Sep 09 '15 at 07:18
  • By "comparing", do you mean "sorting"? There are many sort algorithms. Some of them work well with [Streams](http://docs.oracle.com/javase/7/docs/api/java/io/FileInputStream.html) – Thomas Weller Sep 09 '15 at 07:21
  • both are text files of 2gb and 3gb respectively. I tried to run the program in a server machine and it was running there for 22 hours but still then it was running.Then I stopped the running.@ – Surachit Sarkar Sep 09 '15 at 12:47
  • both are text files of 2gb and 3gb respectively. I tried to run the program in a server machine and it was running there for 22 hours but still then it was running.Then I stopped the running.@bigmike: in each line of first file there are two numbers(suppose region from 114456 to 255566). And in second file in each line there is a single number.I had to be sure if this single number falls under that region(mentioned into file1).Then I had to merge some part of line of first file with 2nd file.every word of a line of these files are separated by tab and all the lines have same number of word. – Surachit Sarkar Sep 09 '15 at 12:56
  • @BigMike : the input files are bed and fasta format file, which contains text contents – Surachit Sarkar Feb 10 '16 at 09:11
  • @SurachitSarkar: the main issue is files dimension, can you split your program in 2 phases, one reading the files inside a database and another performing the cross correlation ? – BigMike Feb 10 '16 at 09:18

1 Answers1

0

You might want to try increasing the size of your virtual memory. If its working for small files and not for large ones, then you are probably running out of memory. 2-3 GB per file is really huge for a text file.

Jobin Jose
  • 184
  • 1
  • 2
  • 14