### Reading a .txt file into a 3D array

I am trying to create an array for calculating allele frequencies from a text file with nucleotide data for multiple individuals from several different populations. Basically what I want to do is count the number of each allele (A, T, C, or G) that occur at each locus (position) for each population so that I can then calculate the frequency of each allele in each population.
Here is an example of a truncated version of the text file I want to read into the array:
>1-1_Sample 1
GCCCATGGCT
>2-1_Sample 1
GAGTGTATGT
>3-1_Sample 1
TGTTCTATCT
>1-1_Sample 2
GCTTAGCCAT
>2-1_Sample 2
TGTAGTCAGT
>3-1_Sample 2
GGGAACCAAG
>1-1_Sample 3
TGGAAGCGGT
>2-1_Sample 3
CGGGAGGAGA
>3-1_Sample 3
CTTCAGTTTT

Here the first letter after the ">" denotes an individual in the population, so here each population has 3 individuals. The sample number denotes which population the individual belongs to. So there are three populations here.
The string of letters after each header line is the nucleotide sequence. I want to read in the first letter from each sequence (G, G, and T for the first population) and then place them in an array that will calculate a total number of each allele (A, T, C, or G) in each position. So for example, population 1 has 2Gs and 1T in the first position, population 2 has 2Gs and 1T in the first position, and the 3rd population has 1T and 2 Cs in the first position.

How can I make an array to contain this data? I want to change the nucleotide data to numerical data, where A=0, C=1, G=2, and T=3, and then I need to total the number of 0s (As) in the first position for each population. I understand that this should be a 3D array with columns for each site, rows for each of the 4 nucleotides, and "stacked" for each population. I am not at all sure how to do this, I'm just starting out in C++ and this seems a bit difficult to read into a 3D array AND count the number of each nucleotide for each cell.

I need intelectual help getting over this obstacle :/
Homework is for students

http://www.cplusplus.com/forum/articles/7459/
Yes, I've seen that post, my question is not how to make an array, but how to read in the nucleotide data I discussed above in an additive fashion. I'm having a hard time connecting the dots between setting up an array where you just read a text file directly into it for storage purposes, and being able to count the nucleotides into the correct cell of the array. I didn't bother writing out all of my code which I used to create an array of the proper size, etc. because that's not my question. I want someone to help me undertand a way to add nucleotides to either the A, T, C, or G cell of the array for each population. I've spent a great deal of time searching posts on this forum, and none of them address this issue, that's why I made this post.
Also, I'm not a student, but someone interested in learning more about C++ for research related reasons, I have no one in my working group I can ask, so I turned to the internet. I do know how to google.
You need to associate each letter with an index of an array:

 ``123456789101112`` ``````const char *phrase = "this is a phrase that does not mean anything"; size_t length = strlen(phrase); // size_t is of unsigned integral type int letterCount[26]; for (int i = 0; i < 26; i++) letterCount[i] = 0; for (size_t i = 0; i < length; i++) letterCount[ phrase[i] - 'a' ]++; // phrase must be all lowercase letters for this to work for (int i = 0; i < 26; i++) cout << (char)(i + 'a') << " count = " << letterCount[i] << endl;``````

For you, I might do:
 ``12345`` ``````switch (code[i]) { case 'A': count[0]++; break; case 'C': count[1]++; break; //... ``````
Last edited on
 I'm not a student

Ok good, I was teasing about google, but you'd be surprised how many people want you to do it for them. especially kids.... Mine are the worst.

I was interested in your project, so I put this together while waiting for you to reply. I do not claim it is the best way, there are probably many better ways, but this seems to work with your sample data but don't think it will work with a larger file. without some edits, not sure, haven't tried it.

Try it with the sample file, try to understand what it does and ask if you have questions.

Useage: Command <FileName.txt>

I noticed if you put in a improper file name, it runs anyway, but with zero data. should probably fix that.

 ``123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126`` ``````#include #include #include #include #include #include #include #include #include #include using namespace std; ofstream outputfile; string line; // input line string // >1-1_Sample 1 // ^ is x string x; // >1-1_Sample 1 // ^^^^^^^^ is y string y; // >1-1_Sample 1 // ^ is z string z; char allele [3][3][10]; // 3D array //ex:allele [i][j][k] int i=0; int j=0; int k=0; // nucleotide int a=0; int c=0; int g=0; int t=0; int counter=0; // used to count number of Rows int main (int argc, char *argv[]) { if (argc!=2) { cout << "Useage: Command \n"; } else { // open file argv[1] ifstream inputfile (argv[1]); while ( inputfile.good() ) { getline(inputfile,line); if (line[0]=='>') { // cout << line << endl; // cout << line.size() << endl; x=line; y=line; z=line; x.erase(0,1); x.erase(1,12); y.erase(0,3); y.erase(8,2); z.erase(0,12); // cout << x << endl; // cout << y << endl; // cout << z << endl; if (x=="1") {i=0;} else if (x=="2") {i=1;} else if (x=="3") {i=2;} if (z=="1") {j=0;} else if (z=="2") {j=1;} else if (z=="3") {j=2;} // cout << "I = " << i << endl; // cout << "J = " << j << endl; } else { for(size_t k = 0; k < line.length(); k++) { // cout << line[k] << " = "; // cout << line << endl; allele[i][j][k]=line[k]; // cout << i << j << k << " = " << line[k] << " = "; // cout << allele[i][j][k] << endl; } } } // everything above populates the array // everything below reads data from the array and counts the A, C, G, T for (k=0;k<=9;k++) { counter++; for (j=0;j<=2;j++) { for (i=0;i<=2;i++) { cout << i << j << k << " = " << allele[i][j][k] << endl; if (allele[i][j][k]=='A') {a++;} else if (allele[i][j][k]=='C') {c++;} else if (allele[i][j][k]=='G') {g++;} else if (allele[i][j][k]=='T') {t++;} } } cout << "---" << endl; cout << "Column " << counter << " A= " << a << " C= " << c << " G= " << g << " T= " << t << endl<< endl; // Reset a=0; c=0; g=0; t=0; } } return 0; }``````
Last edited on
This is a bit sloppy, but perhaps it'll give you an idea or two.

 ``123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106`` ``````#include #include #include #include #include #include enum Allele { A, C, G, T } ; struct NucleotideData { unsigned identity ; std::vector alleles ; }; struct AlleleCount { std::string populationID ; std::vector count ; AlleleCount( const std::string& s, std::vector && v ) : populationID(s), count(std::move(v)) {} }; typedef std::pair record_type ; typedef std::map> map_type ; std::istream& readRecord(std::istream& in, record_type &) ; std::vector getAlleleCounts( const std::vector&, unsigned locus ) ; int main() { map_type populationMap ; try { std::ifstream in("test.txt") ; record_type record ; while ( readRecord(in, record) ) populationMap[record.first].push_back(record.second) ; } catch (std::exception& ex){ std::cerr << "ERROR: " << ex.what() << '\n' ; return 0 ; } for ( unsigned locus = 0; locus < 5 ; ++locus ) { std::vector alleleCount ; for ( auto & i : populationMap ) alleleCount.emplace_back(AlleleCount(i.first, getAlleleCounts(i.second, locus))) ; for ( const auto & i : alleleCount ) { std::cout << i.populationID << ", locus " << locus << '\n' ; std::cout << '\t' << "A: " << i.count[A] << " C: " << i.count[C] << " G: " << i.count[G] << " T: " << i.count[T] << '\n' ; } std::cout << '\n' ; } } std::istream & readRecord(std::istream& in, record_type & rec) { rec.second.alleles.clear() ; char ch ; if ( !(in >> ch ) ) // '>' return in ; in >> rec.second.identity ; // individual id in.ignore() ; // '-' std::getline(in, rec.first) ; // population id std::string alleles ; std::getline(in, alleles) ; if ( !in ) throw std::runtime_error("readRecord: Unable to read full record") ; const std::string alleleMap = "ACGT" ; for ( unsigned i=0; i getAlleleCounts( const std::vector& data, unsigned locus ) { std::vector counts(4) ; for ( auto & i : data ) ++counts[ i.alleles[locus] ] ; return counts ; }``````