Driving me nuts.

I guess it is a long shot; it's a complex problem and I don't know if it is possible for anyone to help, but here goes. The problem is as follows.

For an arbitrary number N of DNA chains (strings composed of 'A,C,G,T'), of arbitrary and variable length, determine the size of the largest substring common to K chains.

I've written some code as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
int main(int argc, char** argv) {
    
    int n,k;
    int i,j;
    int b,c,d,e;
    int v,y;
    int q;
    int p=1;
    int a=1;
    char ** x;

    cin >> n >> k;

    const int cols = 100;

    // allocation
    x = new char*[n];
    for(int i = 0; i < n; i++)
            x[i] = new char[cols];

    // initialization
    for(int j = 0; j < n; j++)
            for(int i = 0; i < n; i++)
                    x[i][j] = 32;
    
    for(i=0;i<n;i++) {
        cin >> x[i];
    }
    
    for(e=0;e<n;e++){
        p=a;
    for(a=p;a<=100;a++) {
        for(b=0;b<101-a;b++) {
            v=0;
            q=1;
            if(x[e][b+a-1]!='A' && x[e][b+a-1]!='C' && x[e][b+a-1]!='G' && x[e][b+a-1]!='T')  //(-1)??
                   goto stj;
            for(i=b;i<b+a;i++) {      //
                v=v+x[e][i]*(7^(i-b));  //hash
            }                         //           
            for(c=e+1;c<n;c++) {
                for(d=0;d<100;d++) {
                    y=0;
                    if(x[c][d+a-1]!='A' && x[c][d+a-1]!='C' && x[c][d+a-1]!='G' && x[c][d+a-1]!='T')  //(-1)??
                        break;
                    for(j=d;j<d+a;j++) {     //
                        y=y+x[c][j]*(7^(j-d));  //hash
                    }                         //     
                    if(v==y) {
                        q++;
                        break;
                    }
                }
                if(q==k) {
                    //p=a;
                    goto stf;
                }
                if(q+n-c-1<k) {  
                    //p=a;
                    goto stg;
                }
                if(n-e<k) {
//                        p=a;
                        goto sth;
                }
            }
            
            stg:;            
        }
        stf:;                  
    }
    stj:;
    }
    sth:;
    
    cout << a-1 << endl;
    
    return 0;
}


Basically, my code uses brute force, with hashing to accelerate string comparison, and with some limits in place to avoid unnecessary iterations.

After an afternoon of debugging, though. It only gives the correct answer for some inputs. I completely abandoned all hope of finding the problem. I've made a few tweaks since the original code, but now I'm stuck as to what can be making the problem to fail.

As I said before, it's a long shot. But if someone spots the bug, please let me know. Thanks.


EDIT: For clarification.
The program iterates through all possible values of a,b,c,d,e, with only some exceptions to exclude useless iterations.

a - length to check
b - position on masterString
c - # of slaveString
d - position on slaveString
e - # of masterString

It checks all possible combinations, starting at lenght a=1 and incrementing when it finds that the requirements are fulfilled (K==number of strings that have that substr in common). When, for a=n, it cannot find any combination that fulfills the requirements, it exits the loop and concludes that the maximum size is n-1.
Last edited on
UPDATE: I've managed to find the bug. The factor I used for hashing (7) was too low and occasionally had problems. I've replaced it with 43, a higher prime that is still to low to cause overflow.
closed account (zvRX92yv)
Here's a prime number for you.

1222253171

But I have no idea what you are trying to do.
Absolutely no idea.

EDIT: Wait, that's not a prime number, I'm changing a lot of stuff on my prime search program.
Last edited on
Topic archived. No new replies allowed.