0

I constructed a class for dealing in a certain file formal and it's constructor goes through the file and searches for the key information I need - the idea is characters are written on multiple lines, and I want to read the first character of every line, the second character of every line and so on.

I've got the constructor and definition below (possibly horrible - this is my first time writing anything serious in C++),

class AlignmentStream{
private:
    const char* FileName;
    std::ifstream FileStream;
    std::vector<int> NamesStart;
    std::vector<int> SequencesStart;
    std::vector<int> SequenceLengths;
    int CurrentPosition;
    int SequenceNum;


public:
    AlignmentStream(const char* Filename);
    std::vector<int> tellSeqBegins();
    std::vector<int> tellNamesStart();
    std::vector<int> tellSequenceLengths();
    int getSequenceNum();
    AlignedPosition get();
};


AlignmentStream::AlignmentStream(const char* Filename)
{
    FileName = Filename;
    FileStream.open(FileName);
    std::cout << "Filestream is open: " << FileStream.is_open() << std::endl;
    std::cout << "Profiling the alignment file..." << std::endl;
    if (FileStream.is_open() == false)
        throw StreamClosed(); // Make sure the stream is indeed open else throw an exception.
    if (FileStream.eof())
        throw FileEnd();
    char currentchar;
    // Let's check that the file starts out in the correct fasta format.
    currentchar = FileStream.get();
    if (FileStream.eof())
        throw FileEnd();
    if (currentchar != '>')
        throw FormatError();
    NamesStart.push_back(FileStream.tellg());
    bool inName = true;
    bool inSeq = false;
    int currentLength = 0;
    while(!FileStream.eof()){
        while (!FileStream.eof() && inName == true) {
            if (currentchar == '\n') {
                inName = false;
                inSeq = true;
                SequencesStart.push_back(FileStream.tellg());
            } else {
                currentchar = FileStream.get();
            }
        }
        while (!FileStream.eof() && inSeq == true) {
            if (currentchar == '>') {
                inName = true;
                inSeq = false;
                NamesStart.push_back(FileStream.tellg());
            } else {
                if (currentchar != '\n') {
                    currentLength++;
                }
                currentchar = FileStream.get();
            }
        }
        SequenceLengths.push_back(currentLength); // Sequence lengths is built up here - (answer to comment)
        currentLength = 0;
    }
    SequenceNum = (int)SequencesStart.size();
    // Now let's make sure all the sequence lengths are the same.
    std::sort(SequenceLengths.begin(), SequenceLengths.end());
    //Establish an iterator.
    std::vector<int>::iterator it;
    //Use unique algorithm to get the unique values.
    it = std::unique(SequenceLengths.begin(), SequenceLengths.end());
    SequenceLengths.resize(std::distance(SequenceLengths.begin(),it));
    if (SequenceLengths.size() > 1) {
        throw FormatError();
    }
    std::cout << "All sequences are of the same length - good!" << std::endl;
    CurrentPosition = 1;
    FileStream.close();
}

Apologies for it being quite the chunk,anyway the constructor goes through char by char and gets the starting points of each line to be read. The get function (not shown) then goes through and seeks to the start of each line + how many more to get to the right character - given by the member variable CurrentPos. It then constructs another custom object of mine called AlignedPosition and returns it.

AlignedPosition AlignmentStream::get()
{
    std::vector<char> bases;
    for (std::vector<int>::iterator i = SequencesStart.begin(); i != SequencesStart.end(); i++) {
        // cout messages are for debugging purposes.
        std::cout << "The current filestream position is " << FileStream.tellg() << std::endl;
        std::cout << "The start of the sequence is " << *i << std::endl;
        std::cout << "The position is " << CurrentPosition << std::endl;
        FileStream.seekg((int)(*i) + (CurrentPosition - 1) );
        std::cout << "The Filestream has been moved to " << FileStream.tellg() << std::endl;
        bases.push_back(FileStream.get());
    }
    CurrentPosition++;
    //this for loop is just to print the chars read in for debugging purposes.
    for (std::vector<char>::iterator i = bases.begin(); i != bases.end(); i++) {
        std::cout << *i << std::endl;
    }
    return AlignedPosition(CurrentPosition, bases);
}

As you can see the first loop iterates through the start position of each line + the CurrentPosition and then gets the char and pushes it back onto a vector, this vector is passed to my AlignedPosition constructor, everything else is messages for debugging. However upon execution I see this:

eduroam-180-37:libHybRIDS wardb$ ./a.out
Filestream is open: 1
Profiling the alignment file...
All sequences are of the same length - good!
SeqNum: 3
Let's try getting an aligned position
The current filestream position is -1
The start of the sequence is 6
The position is 1
The Filestream has been moved to -1
The current filestream position is -1
The start of the sequence is 398521
The position is 1
The Filestream has been moved to -1
The current filestream position is -1
The start of the sequence is 797036
The position is 1
The Filestream has been moved to -1
?
?
?
Error, an invalid character was present
Couldn't get the base, caught a format error! 

In short what I see is that the file stream position is -1 and does not change when seeks is used.Which leads to invalid characters and an exception getting thrown in my AlignedPosition constructor. Is this something do do with already having navigated through the file until the end in my constructor? Why does my position in the input stream remain at -1 all the time?

Thanks, Ben.

SJWard
  • 3,629
  • 5
  • 39
  • 54
  • 1
    If you get an end of file on a stream, `seekg` may not clear it. You need to call `clear()` on the stream first. Since you read until EOF, you probably need to call `clear`. (Ref: http://en.wikipedia.org/wiki/Seekg ) – Joe Z Nov 29 '13 at 17:26
  • Thanks, I hope basically it is clear what I'm trying to accomplish, the input file has a series of character strings on lines, what I first do in the constructor is find the start positions of each one (they consist of a line beginning with '>' and then a name, and then the line beneath is the actual string). So then when I call get it should get the first char of every sequence (aka the first position), and then when I call get() a second time - the second char of every sequence, and so on. – SJWard Nov 30 '13 at 00:16
  • Hi, your clear() suggestion works! If you put it as an answer I'll up vote and set it as the answer. – SJWard Nov 30 '13 at 15:00
  • PS if anyone has comments on how I'm moving through the file or if there's an easier way for me to do what I'm doing I'm happy to hear - this is my first go at doing stuff with file streams in any serious manner beyond just reading in Hello World! – SJWard Nov 30 '13 at 15:01
  • Done! As for doing what you're doing more easily or more efficiently: If you have enough memory, then what I'd do is read the data into memory, probably storing it in `vector`s or `array`s. Since all the records are supposed to be the same size, you know what size they're supposed to be after the first record, which simplifies things. It means you can detect an error immediately, rather than at the end, and it can also be used to simplify your read-in code to read in fixed-length records, just verifying the newline and `>` are where you expect. – Joe Z Nov 30 '13 at 16:53
  • Hi, thanks for the suggestion. Part of my challenge is anticipating very big datasets - I don't know what passes for a big structure in computer science and programming as much as big data in a biological context. Perhaps I'll play with both approcaches and see how it goes. – SJWard Nov 30 '13 at 17:30
  • If you're using a 64-bit compiler and 64-bit OS, then really it's a matter of looking at how big the data is relative to the RAM in your system. (ie. if you have 8GB of RAM, you can probably work with files up to 7.5GB comfortably). A different issue is accessing the data in an order that's friendly to the CPU and caches. Your current "column wise" approach will run slowly even if everything is in RAM. But, like any optimization problem, the first step is to get something that works, and _then_ start analyzing the bottlenecks. Premature optimization is the root of all evil. – Joe Z Nov 30 '13 at 20:30
  • In a biological context the sequences I'm dealing with are aligned, so it makes intuitive sense to store the columns in a class (my AlignedPosition class you see in the code). I figured it would be best to read them in as such. I wonder, say I pre-allocate a load of empty AlignedPositions in a vector and then just read in the file sequences as strings, and then looped over each string and added the char to the right AlignedPosition in the vector, would this be faster? – SJWard Nov 30 '13 at 21:30
  • I guess it really depends on the processing you'll be doing after. Experiment and measure! – Joe Z Dec 01 '13 at 00:22

1 Answers1

1

If you get an end of file on a stream, seekg may not clear it. You need to call clear() on the stream first. Since you read until EOF, you probably need to call clear(). (Ref: en.wikipedia.org/wiki/Seekg )

Joe Z
  • 17,413
  • 3
  • 28
  • 39