I guess it is a long shot; it's a complex problem and I don't know if it is possible for anyone to help, but here goes. The problem is as follows.
For an arbitrary number N of DNA chains (strings composed of 'A,C,G,T'), of arbitrary and variable length, determine the size of the largest substring common to K chains.
I've written some code as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
|
int main(int argc, char** argv) {
int n,k;
int i,j;
int b,c,d,e;
int v,y;
int q;
int p=1;
int a=1;
char ** x;
cin >> n >> k;
const int cols = 100;
// allocation
x = new char*[n];
for(int i = 0; i < n; i++)
x[i] = new char[cols];
// initialization
for(int j = 0; j < n; j++)
for(int i = 0; i < n; i++)
x[i][j] = 32;
for(i=0;i<n;i++) {
cin >> x[i];
}
for(e=0;e<n;e++){
p=a;
for(a=p;a<=100;a++) {
for(b=0;b<101-a;b++) {
v=0;
q=1;
if(x[e][b+a-1]!='A' && x[e][b+a-1]!='C' && x[e][b+a-1]!='G' && x[e][b+a-1]!='T') //(-1)??
goto stj;
for(i=b;i<b+a;i++) { //
v=v+x[e][i]*(7^(i-b)); //hash
} //
for(c=e+1;c<n;c++) {
for(d=0;d<100;d++) {
y=0;
if(x[c][d+a-1]!='A' && x[c][d+a-1]!='C' && x[c][d+a-1]!='G' && x[c][d+a-1]!='T') //(-1)??
break;
for(j=d;j<d+a;j++) { //
y=y+x[c][j]*(7^(j-d)); //hash
} //
if(v==y) {
q++;
break;
}
}
if(q==k) {
//p=a;
goto stf;
}
if(q+n-c-1<k) {
//p=a;
goto stg;
}
if(n-e<k) {
// p=a;
goto sth;
}
}
stg:;
}
stf:;
}
stj:;
}
sth:;
cout << a-1 << endl;
return 0;
}
|
Basically, my code uses brute force, with hashing to accelerate string comparison, and with some limits in place to avoid unnecessary iterations.
After an afternoon of debugging, though. It only gives the correct answer for some inputs. I completely abandoned all hope of finding the problem. I've made a few tweaks since the original code, but now I'm stuck as to what can be making the problem to fail.
As I said before, it's a long shot. But if someone spots the bug, please let me know. Thanks.
EDIT: For clarification.
The program iterates through all possible values of a,b,c,d,e, with only some exceptions to exclude useless iterations.
a - length to check
b - position on masterString
c - # of slaveString
d - position on slaveString
e - # of masterString
It checks all possible combinations, starting at lenght a=1 and incrementing when it finds that the requirements are fulfilled (K==number of strings that have that substr in common). When, for a=n, it cannot find any combination that fulfills the requirements, it exits the loop and concludes that the maximum size is n-1.