Genome Work

Hello all, I have been working on this for awhile and cannot seem to get the program to work. In short the errors i am receiving are this:
1. If the read function is in play and the file containing the genome (the same genome as is currently saved to the variable) then when i run it all i get is numbers.
2. When the variable genome is simply assigned the value of the entire genome sequence, when i run it, the program iterates the first genome (incorrect one at that) 3 times and i cannot find out why.
3. The point of this program is to read the genome sequence from a text file and iterate the genes located within it(also would like it to save the output to another text file), the rules are: Genes are divisible by 3, a Gene starts with ATG and ends with TAG, TAA, TGA and cannot contain any of the 4 sequences just mentioned.

The code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#include <iostream>
#include <string>
#include <cctype>
#include <cmath>
#include <algorithm>
#include <fstream>
#include "Signature.h"
using namespace std;

int main() 
{
	DoSign();

char str[256];
string genome;
string iBuffer;

cout<<"Enter the name of an existing text file: ";
cin.get (str,256);
ifstream is(str);

if(is.fail());
exit(1);

genome.clear();
while(!is.eof())
{

is>>iBuffer;
genome += iBuffer;
}
is.close();

cout<<genome<<endl;
cout<<"past the file retrieve"<<endl;

^^^^^^^This section returns a string of numbers >.<^^^^^^^


/*string genome = "ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACC
GCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAG
CGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCT
CCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCA
GGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCT
TCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAA
TTACAGACCTGAA";*/

	//If the string is NOT empty, then enter this loop
    while(!genome.empty())  
    {
    	
    	 //If ATG is not found then NPOS is returned
   		if(genome.find("ATG",0) == string::npos)
    	        {                                
       		 genome.clear(); 
   		}
    		else
    		{
    			//Locates the beginning ATG sequence
        		int startGene = genome.find("ATG",0);
        		//Locates ending sequence
        		int endGene = min(min(genome.find("TAA"), genome.find("TAG")), genome.find("TGA")); 
        		
               //Places located preliminary gene into a substring
        		string currentGene = genome.substr(startGene + 3, endGene - (startGene +3));

				//Validates that the preliminary gene is divisble by 3 and outputs it if true
        		if((currentGene.length() % 3) == 0)
        		{
        			cout<<""<<endl;
            		cout << currentGene << endl;
        		}
				
				
        		endGene += 3;
        		
        		//Erases current Gene including starting ATG and ending sequence.
        		genome.erase(0, (endGene));
        		

    		}
    }



    return 0;
}
Last edited on
1) What happens if endGene would become npos? (not found in string)
2) What happens if first found endGene would be befor startGene in string?
Last edited on
how long is a gene?
and this:
a Gene starts with ATG and ends with TAG, TAA, TGA and cannot contain any of the 4 sequences just mentioned

seems completely contradictory to me.
The sequence starts with the codon "ATG", and ends with a "TAG", "TAA", or "TGA". The inside of the sequence (between the start and end codons) cannot contain additional start or end codons.

I think is what the TC was trying to say. I could guess answers to MiiNiPaa's questions, but I'd hope the TC has some specification they are working towards.
First, it would be really nice, if the code would be in code tags (and with intuitive indentation). See http://www.cplusplus.com/articles/jEywvCM9/

1
2
3
4
5
6
char str[256];
cout<<"Enter the name of an existing text file: ";
cin.get( str, 256 );  // bad style.  Use std::string rather than char array
std::ifstream is(str);
std::string genome = str; // Error. This does not read the contents of a file into "genome"
is.close();


Then some biological problems. Codon is a triplet; three characters. You could do any of these three:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Frame 1
pos = 0;
while ( 2 + pos < genome.size ) {
  IF {genome[pos], genome[pos+1], genome[pos+2]} is ATG ...
  pos += 3;
}

// Frame 2
pos = 1;
while ( 2 + pos < genome.size ) {
  IF {genome[pos], genome[pos+1], genome[pos+2]} is ATG ...
  pos += 3;
}

// Frame 3
pos = 2;
while ( 2 + pos < genome.size ) {
  IF {genome[pos], genome[pos+1], genome[pos+2]} is ATG ...
  pos += 3;
}

Three different reading frames. http://en.wikipedia.org/wiki/Reading_frame

It is true that a gene can be in any of the three frames. The use of find seeks them all at once. However, when you have chosen a frame, you should seek stop-codon only from that frame, rather than find something and check its frame-correctness later.

It is possible that there are overlapping genes. E.g
ATGAATGTATAGCTAA
=>
ATG-AAT-GTA-TAG
     ATG-TAT-AGC-TAA

If you have found a start-codon and then find a new start-codon in same reading frame, then your "cannot contain" rule says that a possible gene starts from the latter.
Zhuge and Keskiverto was right about the codons, the gene begins with ATG and ends with TAA/TGA/TAG and cannot contain any of the ending triplets or the beginning triplet.

The gene can be as long as it wants, the only stipulation to length is that it is divisble by 3 (simply because it is made of triplet codons)

Also the string::npos is used to return an empty value if the genome does not contain any starting codons, this statement is for validation purposes.
Also i did update my code a lil bit and it is giving me a blank output. any ideas on how to fix this?
What do you mean by "blank output"?
When i run the program it simply states my signature at the top, and afterwards shows the "Press any key to continue". I hadn't made your suggestive change in my code yet because i did alter the input statement so that it is reading the contents of the file to a string properly, however it appears now that when the program is ran, it simply states my signature and nothing else. The code above had been altered to reflect this change.
What does the DoSign do?
The DoSign is for my signature to be printed on the output, it is done in this method so that with a single 2 syllable line i can add my signature (which is 4 lines) to whatever piece of work i need to. On the version on my computer it prints it out. If your troubleshooting it here then it can be removed.

So any further help with getting this code to work? or do i have to erase the selection section and replace it with the code presented to me? (The frames)
When i run the program it simply states my signature at the top, and afterwards shows the "Press any key to continue".
1
2
3
4
5
DoSign();
char str[256];
string genome;
string iBuffer;
cout<<"Enter the name of an existing text file: ";

You state that line 5 does not execute. The string constructors could in principle throw exceptions, but line 1 is still the first thing to check.
Yes i commented out the signature however at this point it is saying exit 1, which in the code is line 26. so i am correctly entering the name of the text file containing the code that needs to be read into the string but it is exiting right after it is read as a failure
This code?
1
2
3
4
5
cin.get (str,256);
ifstream is(str);

if(is.fail());
exit(1);

Now that we pay attention to it, you have written:
1
2
3
4
if( is.fail() ) {
}

exit(1);

Using braces even with one-liners more than just "style".


Besides, you could write return 1; rather than exit(1);
Oh awesome i missed that thankyou! Just 1 more thing, i have been working on this code for the better part of the day and it finally reads the file and outputs the genes found, only problem is that it repeats the first gene 3 times (which is also a false gene since it contains "TAA". Any clues as to why? The main section i can think this would refer too is the section starting with:

while(!genome.empty())
{
........
}

This is the section that sorts and prints out the genes.

After a manual search there are indeed only 2 genes in the whole genome that qualify, so why is it reading this false gene and printing it 3 times?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
#include <iostream>
#include <string>
#include <cctype>
#include <cmath>
#include <algorithm>
#include <fstream>
#include "Signature.h"
using namespace std;

int main() 
{
	DoSign();
string genome;
string iBuffer;	

char filename[500];
ifstream file;
cout<<"Please enter the file containing the genome that requires analysis: "<<endl;
cin.get (filename,500);
file.open(filename);

if(!file.is_open())
{
	exit (EXIT_FAILURE);
}

file>>genome;
while(file.good())
{
	cout<<genome<<" ";
	file>>genome;
}

//genome.clear();
while(!file.eof())
{

file>>iBuffer;
genome += iBuffer;
}

cout<<""<<endl;
cout<<"The following genome has been located and analyzed for genes."<<endl;
cout<<""<<endl;
cout<<genome<<endl;

cout<< ""<<endl;
cout<<"The following genes have been identified."<<endl;

	//If the string is NOT empty, then enter this loop
    while(!genome.empty())  
    {
    	
    	 //If ATG is not found then NPOS is returned
   		if(genome.find("ATG",0) == string::npos)
    	{                                
       		 genome.clear(); 
   		}
    		else
    		{
    			//Locates the beginning ATG sequence
        		int startGene = genome.find("ATG",0);
        		//Locates ending sequence
        		int endGene = min(min(genome.find("TAA"), genome.find("TAG")), genome.find("TGA")); 
        		
               //Places located preliminary gene into a substring
        		string currentGene = genome.substr(startGene + 3, endGene - (startGene +3));

				//Validates that the preliminary gene is divisble by 3 and outputs it if true
        		if((currentGene.length() % 3) == 0)
        		{
        			cout<<""<<endl;
            		cout << currentGene << endl;
        		}
				
				
        		endGene += 3;
        		
        		//Erases current Gene including starting ATG and ending sequence.
        		genome.erase(0, (endGene));
        		

    		}
    }


file.close();

DoSign();
system("pause");
    return 0;
}
For example, the find(s) on line 64 start from position 0 rather than from after startGene.
Topic archived. No new replies allowed.