align text and check if letters are matching on everyline

Write your question here.
Hello guys, first of all i am really sorry for inconvenience i am very new around here.
i have a huge text file that looks like this

Sequence1: ABCDEFG......
Sequence2: AZQDZHG......
Sequence3: AMNDKLG..... . .

I would like to compare Seq1 to with the rest of the sequences, if the letter matches it does nothing but if it doesnt then puts a "-" so

the output: A--D--G since only A,D and G are matching. I have no idea how to do it. I am not familiar with programming so i need suggestions and also how can i do it.

Thank you so much/ forgive me for my bad english.
Hi, this should be a good example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <iostream>
#include <fstream>
#include <string>
#include <vector>

using namespace std;

int main()
{
    ifstream file;
    file.open("file.txt");

    vector<string> sequences;

    string line;
    while(getline(file, line)) {
        sequences.push_back(line);
    }

    for(unsigned int i=0; i<sequences.size(); ++i) {
        cout << sequences[i] << endl;
    }

    cout << endl << "Comparing the sequences..." << endl << endl;

    for(unsigned int i=1; i<sequences.size(); ++i) {
        string s = sequences[i];
        for(unsigned int j=0; j<s.length(); ++j) {
            if(sequences[0][j] == s[j])
                cout << s[j];
            else
                cout << '-';
        }
        cout << endl;
    }

    file.close();
    return 0;
}


Let me know if you have any other problem.
In other words you have a table (aka 2D array) and you want to highlight the conserved columns. A column is conserved, if same character is on every line.


The real question is: Why? You say that you are not familiar with programming. Why do you try that approach? Is it so that you have to program or do you actually want that output, no matter how?
@keskiverto
I am working on my thesis, there is this data and i have to do this by hand actually. since my data is huge, its going to take too much time. So i thought with programming it would make it easier for me as i was told from a friend, it would be a simple coding but unfortunately he doesnt know how to do it. I was advised to do it in C++ so here i am, trying to figure out.
And now is your problem solved?
Hi minomic, i really appreciate ur help and thank you for your concern, no i couldnt solve the problem, i have run the code you gave but i am getting an error. By the way where should i get the output? In the same file? I am still in the process of understanding your coding. Thank you so much again.

http://s29.postimg.org/a5dgwmwcn/Capture1.png
Then there must be something different in the text file: I used a text file which is exactly as follows


ABCDEFG
AZQDZHG
AMNDKLG


And my output is


ABCDEFG
AZQDZHG
AMNDKLG

Comparing the sequences...

A--D--G
A--D--G


which I think is correct. Is this the output that you wanted, right?
Try it on lines of different size.
It would be trivial to fix existing code, but as you need only one line of output I decided to write another variant.
Note that it operates on these assumptions:
1) Your file does not actually contain words Sequence1: , etc. It looks as minomic shown. If it does contain them, you will need to skip those parts first.
2) Your sequences are separated by corect newline symbol and do not have any excess trailing spaces.
3) It has a very primitive handling of string of diferent length (simply does not process characters after length of shortest string)
4) Reads from file "input.txt" and outputs to screen
If this is okay, here it is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <algorithm>
#include <fstream>
#include <iostream>
#include <string>


char mask(char l, char r)
{
    return (l == r) ? l : '-';
}

int main()
{
    std::ifstream in("input.txt");
    std::string line;
    std::getline(in, line);
    std::string temp;
    while(std::getline(in, temp)) {
        std::size_t min = std::min(line.size(), temp.size());
        std::transform(line.begin(), line.begin() + min, temp.begin(),
                       line.begin(), mask);
    }
    std::cout << line;
}
Protein sequences? There are a lot of applications and libraries for operating on them. http://en.wikipedia.org/wiki/Bioinformatics#Software_and_tools
http://molbiol-tools.ca/Alignments.htm

The screenshot looks like "aligned" sequences due to the "gap" characters. What is the format of it? Fasta? Interstitial collagenases?

I am working on my thesis, there is this data and i have to do this by hand actually.

Use brain, not hand. You definitely should, as part of that learning process, find out the available tools and learn their merits and pitfalls. Your thesis supervisor should have introduced you to at least something at start.

Handling sequences is a "basic wheel" in bioinformatics and thus no thesis should reinvent it. Besides, it seems clear that you simply need to analyze data rather than to develop new algorithms. If it would be the latter, then you would have programming background.
Last edited on
@Keskiverto

Yes i found out just like 30 min ago. Anyway, since i am already involved with it. Downloaded c++ and started to learn. Yes i just need to analyze data rather than to develop an algorithm. But i dont see how does this harm you. I am trying to learn something in here. Thanks for the advice tho.
@MiiNiPaa,

Yes my file actually contains words (Organism Names) so i need to skip those parts and yes again some strings are longer. Now i am going to take a look on your coding and understand.

Thanks for the reply.
example file:

Eagle
jskcfu eofeifhe efoehfefhenf eroih359 dsfef953....

Trassier
dsnf30ur30oe8hozzl.....
Bird2
dfgfdhfgj.....
Bird3
gsdhhhtrhthth......


you can extract the lines and keep them in a array (i.e. vector)
then string s= array[0]; s is initialized as first string.

now use a loop through array[1] to array[last]
-the chars of array[i] which are not same as s[i], replace the corresponding char in s by '-'

s is the answer


EDIT: here, the main part is extracting the lines. it depends how the data is formatted.
its a easy format:
keep the names in odd lines(1,3,5....)
and "ABCDEFG" s in even lines(2,4,6...)
Last edited on
By the lines shown on the screenshot the format probably has entries, where one entry contains two fields: title and sequence.
The title is text on one line that starts with >.
The sequence can be divided into multiple lines. One does know that a sequence field ends if there is EOF, or > (of the next entry).

MiiNiPaa's code reads one line of text from file at lines 16 and 18. Those should read one entry each, if the format is as the screenshot hints.


Both programs shown so far write output to std::cout (screen). You can redirect the output to a file.
foo.exe > output.txt
Topic archived. No new replies allowed.