Reading enzyme acronym and recognition sequence from file

I'm having problems getting the Acronym and Recognition sequence from a file.

Lets say I have these in my file:
BsaJI/C'CNNGG//
BsaWI/W'CCGGW//
BsaXI/ACNNNNNCTCCNNNNNNNNNN'/NNN'NNNNNNNGGAGNNNNNGT//
BsaXI/GGAGNNNNNGTNNNNNNNNNNNN'/NNN'NNNNNNNNNACNNNNNCTCC//
BsbI/CAACACNNNNNNNNNNNNNNNNNNNNN'/NN'NNNNNNNNNNNNNNNNNNNGTGTTG//
Bsc4I/CCNNNNN'NNGG//
BscAI/GCATCNNNN'NN/'NNNNNNGATGC//
BscGI/CCCGT/ACGGG//
Bse1I/ACTGGN'/NC'CAGT//


For example, BsaXI is the acronym.
The sequences are:
ACNNNNNCTCCNNNNNNNNNN'
NNN'NNNNNNNGGAGNNNNNGT
GGAGNNNNNGTNNNNNNNNNNNN'
NNN'NNNNNNNNNACNNNNNCTCC

Here's what I'm attempting to do, read up to the first '/' character, it'll be the acronym. I'll get the line, and get the substring up to '/' as the recognition sequence for the acronym. I keep parsing through the string and insert it into my data structure.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  ifstream theFile;
	  theFile.open(db_filename);
	  if(theFile.is_open()){
	    string aFileLine;
	    while(getline(theFile, aFileLine)){
	   	  // Sets Enymze Acronym
	      string anEnzymeAcronym = aFileLine.substr(0,aFileLine.find('/'));
	      string aRegoSequence;
	      // Keeps track of the starting position of the Recognition Sequence
	      int tracker = 0;
	      int wholeStringLength = aFileLine.length();
	      string test= "";
	      // While the tracker is not up to the 2nd last '/'
	      while(tracker != wholeStringLength - 2){
	      	string a = aFileLine.substr(tracker, aFileLine.find('/'));
	      	// Updates tracker position to read the next recognition sequence
	      	tracker += a.length()+2;
	      	aRegoSequence = aFileLine.substr(tracker, aFileLine.find('/'));
	      	// Creates a SequenceMap object with the recognization sequence and enzyme acronym 
	      	SequenceMap aSequenceMap(aRegoSequence, anEnzymeAcronym);
	      	// Inserts the sequence and acronym into the tree
	      	a_tree.insert(aSequenceMap);
	      }
	    }
	  }
	  else{
	    cout << "No file exists!\n";
	  }
Last edited on
Since you don't post your whole code, I can only give you suggestions how to do it.

Prepare a vector to store the recognition sequences.

1. Read each line with std::getline()
2. Import the line you have read with a std::istringstream.
3. Do a std::getline() with delimiter '/' to get the enzyme name.
4. If the enzyme is really the acronym you need, then : 
    4.1. Continually do a std::getline() with delimiter '/' to get the recognition sequences, then push them into the vector.


If you follow these instructions closely you will never fail. Be sure to know what a std::istringstream is.
Last edited on
I was planning to use to std::getline with a '/' as the delimiter, but I don't know how I can skip the enzyme acronym because if i do something like
 
getline(theFile, aRegoSequence, '/');

It's going to read the first '/' which would be the enzyme acronym, so I opted to make a counter and parse through the string and keep track of the position. I don't think storing it into a vector is necessary, since I created a sequence map with the sequence and the acronym and used my implemented insert function to place it into my tree.
Last edited on
If you follow these instructions closely you will never fail.

I am implying that do you know std::istringstream? When you read a whole line with std::getline() pass the string to a std::istringstream then let it do the job.

getline(theFile, aRegoSequence, '/');

I never tell you to do it. I want you to read a whole line then pass it to std::istringstream, then you can use std::getline() with delimiter '/'.
Last edited on
For your domain, it would be worthwhile to learn to use the regular expressions library.
http://en.cppreference.com/w/cpp/regex
It would come in very handy, over and over again.

Here's an example of using regular expressions to do this particular task:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <iostream>
#include <regex>
#include <string>
#include <map>
#include <set>
#include <sstream>

// rec_seq_map_type maps an acronym (key) to a set of all its recognition sequences
using rec_seq_map_type = std::map< std::string, std::set<std::string> > ;

rec_seq_map_type get_rec_seq_from( std::istream& stm )
{
    rec_seq_map_type map ;

    std::string line ;
    while( std::getline( stm, line ) )
    {
        // parse the line into / delimited tokens
        // a token is a sequence of one or more characters other than /
        const std::regex re( "[^/]+" ) ;
        std::sregex_iterator iter( line.begin(), line.end(), re ), end ;

        if( iter != end ) // if there is at least one match
        {
            // the first token, the key
            auto& set = map[ iter->str() ] ; // reference to the set associated with this key

            // the remaining tokens are the recognition sequences; insert them into the set
            for( ++iter ; iter != end ; ++iter ) set.insert( iter->str() ) ;
        }
    }

    return map ;
}

int main()
{
    std::istringstream file(
                "BsaJI/C'CNNGG//\n"
                "BsaWI/W'CCGGW//\n"
                "BsaXI/ACNNNNNCTCCNNNNNNNNNN'/NNN'NNNNNNNGGAGNNNNNGT//\n"
                "BsaXI/GGAGNNNNNGTNNNNNNNNNNNN'/NNN'NNNNNNNNNACNNNNNCTCC//\n"
                "BsbI/CAACACNNNNNNNNNNNNNNNNNNNNN'/NN'NNNNNNNNNNNNNNNNNNNGTGTTG//\n"
                "Bsc4I/CCNNNNN'NNGG//\n"
                "BscAI/GCATCNNNN'NN/'NNNNNNGATGC//\n"
                "BscGI/CCCGT/ACGGG//\n"
                "Bse1I/ACTGGN'/NC'CAGT//\n" ) ;

    const auto rec_seq_map = get_rec_seq_from(file) ;

    for( const auto& pair : rec_seq_map )
    {
        std::cout << "recognition sequences for acronym: " << pair.first << '\n' ;
        for( const auto& str : pair.second ) std::cout << str << '\n' ;
        std::cout << '\n' ;
    }
}

http://coliru.stacked-crooked.com/a/270d027939e06c39
I attempted this in another way with an integer to keep track of where i parsed, but I'm getting out of bound error.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ifstream theFile;
	  theFile.open(db_filename);
	  if(theFile.is_open()){
	    string aFileLine, randomLine;
	    // Skip the first 10 lines of the files
	    for(int i = 0; i < 10; i++){
	    	getline(theFile, randomLine, '\n');
	    }
	    while(getline(theFile, aFileLine)){
	   	  // Sets Enymze Acronym
	      string anEnzymeAcronym = aFileLine.substr(0,aFileLine.find('/'));
	      string aRegoSequence;
	      // Keeps track of the starting position of the Recognition Sequence
	      int tracker = anEnzymeAcronym.length() +1;
	      // While the tracker is not up to the 2nd last '/'
	      while(tracker != aFileLine.length() - 1){
	      	string remainingString = aFileLine.substr(tracker);
	      	aRegoSequence = aFileLine.substr(tracker, remainingString.find('/'));
	      	// Updates tracker position to read the next recognition sequence
	      	tracker += aRegoSequence.length()+1;
	      	// Creates a SequenceMap object with the recognization sequence and enzyme acronym 
	      	SequenceMap aSequenceMap(aRegoSequence, anEnzymeAcronym);
	      	// Inserts the sequence and acronym into the tree
	      	a_tree.insert(aSequenceMap);
	      }
	    }
	  }
	  else{
	    cout << "No file exists!\n";
	  }


I'm not sure why this doesn't work, because I think my logic is correct. I tried using the same logic in cpp shell and it works perfectly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <string>
using namespace std;

int main()
{
  string wholeSequence = "AnSi/XDSACSD'CSADASDCm/SDASDASD'SDASXCZXm//";
  string ac = wholeSequence.substr(0, wholeSequence.find('/'));
  int tracker = ac.length() +1;
  cout << "Tracker length: " << tracker << endl;
  while( tracker != wholeSequence.length() -1){
    string remainingString = wholeSequence.substr(tracker);
    string en = wholeSequence.substr(tracker, remainingString.find('/'));
    tracker += en.length()+1;
    cout << en << endl;
    cout << "Tracker length: " << tracker << endl;
  }
  cout << "Whole Sequence Length: " << wholeSequence.length() << endl;
  cout << "end" << endl;
  
}


This is my SequenceMap class constructor
1
2
3
4
5
6
7
8
9
10
public:
SequenceMap(const std::string &a_rec_seq, const std::string &an_enz_acro) {

		recognition_sequence_ = a_rec_seq;
		enzyme_acronyms_.push_back(an_enz_acro);
	}
private:

	std::string recognition_sequence_;
	std::vector<std::string> enzyme_acronyms_;
Last edited on
Topic archived. No new replies allowed.