Reading a .txt file into a 3D array

I am trying to create an array for calculating allele frequencies from a text file with nucleotide data for multiple individuals from several different populations. Basically what I want to do is count the number of each allele (A, T, C, or G) that occur at each locus (position) for each population so that I can then calculate the frequency of each allele in each population.
Here is an example of a truncated version of the text file I want to read into the array:
>1-1_Sample 1
GCCCATGGCT
>2-1_Sample 1
GAGTGTATGT
>3-1_Sample 1
TGTTCTATCT
>1-1_Sample 2
GCTTAGCCAT
>2-1_Sample 2
TGTAGTCAGT
>3-1_Sample 2
GGGAACCAAG
>1-1_Sample 3
TGGAAGCGGT
>2-1_Sample 3
CGGGAGGAGA
>3-1_Sample 3
CTTCAGTTTT

Here the first letter after the ">" denotes an individual in the population, so here each population has 3 individuals. The sample number denotes which population the individual belongs to. So there are three populations here.
The string of letters after each header line is the nucleotide sequence. I want to read in the first letter from each sequence (G, G, and T for the first population) and then place them in an array that will calculate a total number of each allele (A, T, C, or G) in each position. So for example, population 1 has 2Gs and 1T in the first position, population 2 has 2Gs and 1T in the first position, and the 3rd population has 1T and 2 Cs in the first position.

How can I make an array to contain this data? I want to change the nucleotide data to numerical data, where A=0, C=1, G=2, and T=3, and then I need to total the number of 0s (As) in the first position for each population. I understand that this should be a 3D array with columns for each site, rows for each of the 4 nucleotides, and "stacked" for each population. I am not at all sure how to do this, I'm just starting out in C++ and this seems a bit difficult to read into a 3D array AND count the number of each nucleotide for each cell.


Please please please help me!

I need intelectual help getting over this obstacle :/
Homework is for students

Google is not a bad thing to learn;

http://www.cplusplus.com/forum/articles/7459/
Yes, I've seen that post, my question is not how to make an array, but how to read in the nucleotide data I discussed above in an additive fashion. I'm having a hard time connecting the dots between setting up an array where you just read a text file directly into it for storage purposes, and being able to count the nucleotides into the correct cell of the array. I didn't bother writing out all of my code which I used to create an array of the proper size, etc. because that's not my question. I want someone to help me undertand a way to add nucleotides to either the A, T, C, or G cell of the array for each population. I've spent a great deal of time searching posts on this forum, and none of them address this issue, that's why I made this post.
Also, I'm not a student, but someone interested in learning more about C++ for research related reasons, I have no one in my working group I can ask, so I turned to the internet. I do know how to google.
You need to associate each letter with an index of an array:

1
2
3
4
5
6
7
8
9
10
11
12
const char *phrase = "this is a phrase that does not mean anything";
size_t length = strlen(phrase);  // size_t is of unsigned integral type
int letterCount[26];

for (int i = 0; i < 26; i++)
  letterCount[i] = 0;

for (size_t i = 0; i < length; i++)
  letterCount[ phrase[i] - 'a' ]++;  // phrase must be all lowercase letters for this to work

for (int i = 0; i < 26; i++)
  cout << (char)(i + 'a') << " count = " << letterCount[i] << endl;


For you, I might do:
1
2
3
4
5
switch (code[i])
{
  case 'A':  count[0]++; break;
  case 'C':  count[1]++; break;
//... 
Last edited on
I'm not a student


Ok good, I was teasing about google, but you'd be surprised how many people want you to do it for them. especially kids.... Mine are the worst.

I was interested in your project, so I put this together while waiting for you to reply. I do not claim it is the best way, there are probably many better ways, but this seems to work with your sample data but don't think it will work with a larger file. without some edits, not sure, haven't tried it.

Try it with the sample file, try to understand what it does and ask if you have questions.

Comments were from my testing

Useage: Command <FileName.txt>

I noticed if you put in a improper file name, it runs anyway, but with zero data. should probably fix that.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#include <conio.h>
#include <cstdlib>
#include <dos.h>
#include <fstream>
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <windows.h>
using namespace std;

ofstream outputfile;
string line;  // input line string
// >1-1_Sample 1
//  ^ is x
string x;
// >1-1_Sample 1
//    ^^^^^^^^ is y
string y;
// >1-1_Sample 1
//             ^ is z
string z;
char allele [3][3][10]; // 3D array
//ex:allele [i][j][k]
int i=0;
int j=0;
int k=0;
//  nucleotide
int a=0;
int c=0;
int g=0;
int t=0;
int counter=0;  // used to count number of Rows


int main (int argc, char *argv[])
{
    if (argc!=2)
    {
        cout << "Useage: Command <FileName.txt> \n";
    }
	else
	{
		// open file argv[1]
		ifstream inputfile (argv[1]);
	
		while ( inputfile.good() )
		{
			getline(inputfile,line);
			if (line[0]=='>')
				{
// 				cout << line << endl;
// 				cout << line.size() << endl;
				x=line;
				y=line;
				z=line;
				x.erase(0,1);
				x.erase(1,12);
				y.erase(0,3);
				y.erase(8,2);
				z.erase(0,12);
//				cout << x << endl;
//				cout << y << endl;
//				cout << z << endl;
				
				if (x=="1")
				{i=0;}
				else if (x=="2")
				{i=1;}
				else if (x=="3")
				{i=2;}
				
				if (z=="1")
				{j=0;}
				else if (z=="2")
				{j=1;}
				else if (z=="3")
				{j=2;}
//				cout << "I = " << i << endl;
//				cout << "J = " << j << endl;
				}
			else
			{
				for(size_t k = 0; k < line.length(); k++)
					{
//					cout << line[k] << " = ";
// 					cout << line << endl;
					allele[i][j][k]=line[k];
//					cout << i << j << k << " = " << line[k] << " = ";
//					cout << allele[i][j][k] << endl;
					}
			}
		}
// everything above populates the array
// everything below reads data from the array and counts the A, C, G, T

		for (k=0;k<=9;k++)
		{
		counter++;
		for (j=0;j<=2;j++)
		{
		for (i=0;i<=2;i++)
		{
		cout << i << j << k << " = " << allele[i][j][k] << endl;
		if (allele[i][j][k]=='A')
		{a++;}
		else if (allele[i][j][k]=='C')
		{c++;}
		else if (allele[i][j][k]=='G')
		{g++;}
		else if (allele[i][j][k]=='T')
		{t++;}
		}
		}
		cout << "---" << endl;
		cout << "Column " << counter << " A= " << a << " C= " << c << " G= " << g << " T= " << t << endl<< endl;
		// Reset
		a=0;
		c=0;
		g=0;
		t=0;
		}
	}
return 0;
}
Last edited on
This is a bit sloppy, but perhaps it'll give you an idea or two.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
#include <map>
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <stdexcept>

enum Allele { A, C, G, T } ;

struct NucleotideData
{
    unsigned identity ;
    std::vector<Allele> alleles ;
};

struct AlleleCount
{
    std::string populationID ;
    std::vector<unsigned> count ;

    AlleleCount( const std::string& s, std::vector<unsigned> && v )
        : populationID(s), count(std::move(v)) {}
};

typedef std::pair<std::string, NucleotideData> record_type ;
typedef std::map<std::string, std::vector<NucleotideData>> map_type ;

std::istream& readRecord(std::istream& in, record_type &) ;
std::vector<unsigned> getAlleleCounts( const std::vector<NucleotideData>&, unsigned locus ) ;

int main()
{
    map_type populationMap ;
    
    try {
        std::ifstream in("test.txt") ;

        record_type record ;
        while ( readRecord(in, record) )
            populationMap[record.first].push_back(record.second) ;
    }

    catch (std::exception& ex){
        std::cerr << "ERROR: " << ex.what() << '\n' ;
        return 0 ;
    }

    for ( unsigned locus = 0; locus < 5 ; ++locus )
    {
        std::vector<AlleleCount> alleleCount ;
        for ( auto & i : populationMap )
            alleleCount.emplace_back(AlleleCount(i.first, getAlleleCounts(i.second, locus))) ;

        for ( const auto & i : alleleCount )
        {
            std::cout << i.populationID << ", locus " << locus << '\n' ;
            std::cout << '\t' << "A: " << i.count[A] << "   C: " << i.count[C]
                      << "   G: " << i.count[G] << "   T: " << i.count[T] << '\n' ;
        }
        std::cout << '\n' ;
    }
}

std::istream & readRecord(std::istream& in, record_type & rec)
{
    rec.second.alleles.clear() ;

    char ch ;
    if ( !(in >> ch ) )             // '>'
        return in ;

    in >> rec.second.identity ;     // individual id
    in.ignore() ;                   // '-'
    std::getline(in, rec.first) ;   // population id

    std::string alleles ;
    std::getline(in, alleles) ;

    if ( !in )
        throw std::runtime_error("readRecord: Unable to read full record") ;

    const std::string alleleMap = "ACGT" ;
    for ( unsigned i=0; i<alleles.size() ; ++i )
    {
        std::size_t pos ; 
        if ( (pos = alleleMap.find(alleles[i])) == std::string::npos )
        {
            throw std::runtime_error( "readRecord: Unexpected input found in alleles string: \'" 
                + std::string(alleles[i],1) + '\'' ) ;
        }

        rec.second.alleles.push_back(Allele(pos)) ;
    }

    return in ;
}

std::vector<unsigned> getAlleleCounts( const std::vector<NucleotideData>& data, unsigned locus )
{
    std::vector<unsigned> counts(4) ;

    for ( auto & i : data )
        ++counts[ i.alleles[locus] ] ;

    return counts ;
}
Interested in how your coming.

You have private msgs turned off, but if you want to send me a msg I'd like to know who you work for and what this project is about. not necessary, I'm just curious if this is work for a DNA project or a personal interest.
Topic archived. No new replies allowed.