Euler Method for DNA assembly

New to C++. I have absolutely no background in coding/programming but, at the suggestion of my PI, I'm currently taking a bioinformatics course. The purpose was to learn the theory, however, there are assignments that need to be completed that require coding.

I've googled and search C++ forums for this and have been unable to find anything that can help me get started. I need to create a program that uses the euler method that receives a set of k-mers as input and outputs a string of DNA. If anybody can provide an example or point me in the right direction just to get started.

Hello @brownie206,
You are into gene-sequencing and this forum is into c++ programming, so you may need to state your your actual problem in more familiar terms.

Does the following have any bearing on your problem:
http://bix.ucsd.edu/bioalgorithms/presentations/Ch08_GraphsDNAseq.pdf
and, if so, where in that presentation? (If UCSD is where you are doing your PhD then that's a tad embarassing!)

I'm hazarding a guess - which may be completely and utterly wrong - that you are after the "shortest superstring problem" - i.e. given a set of strings (like ACG, ATT, CGG, ... ) what is the shortest string that contains all of them. The problem is stated on slides 19-20 and the two approaches: Travelling Salesman Problem and Euler Path approach in slides 21-25 and 37-43 respectively. Sadly, the example given for the latter used such short sequences that I couldn't follow the generalisation to longer ones.

Or do you mean something else entirely?
Last edited on
Yes, you are correct. I've stumbled into that presentation and it is similar to what we have covered in class. The relevant portions are the Euler Method.

You're guess is correct. We were giving text files to test our program. Each contains a set of kmers of varying length. We're suppose to write a program that takes those kmers and contructs the shortest string that contains all of them using the Eular Method/path.

That slides you pointed out are correct, focusin on Eular Path. I understand the theory but have no idea how to start a program that does this.

The following mentions Euler (among other things):
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646/

However, it seems to take into account errors. That some of the input kmers are misleading noise.


Does that bioinformatics course state that ability to program is a prerequisite? It probably should.
Hmm, well going from that presentation
http://bix.ucsd.edu/bioalgorithms/presentations/Ch08_GraphsDNAseq.pdf
on slide 37, I think that if you had, for example, a 4-mer like ATGC then this would necessitate two 3-mer NODEs, ATG and TGC, with an EDGE between them. The latter would then need further EDGE connections for any 4-mer of the form TGCx from TGC to GCx. If you had k-mers then the NODEs are (k-1)-mers and the required string is the Euler path (visits each EDGE once: Euler's famous Konigsberg bridges problem).

Sadly, what I can't see from the algorithm, on slides 41-43, is what happens if you end up with a disconnected graph (other than simply appending the strings for each disconnected section).

I can start you off (perhaps), but I'm not sure that I could finish it. There are better graph theorists on this site ... ! Definitely need someone to advise on the best way to store NODEs and EDGEs in a network GRAPH.

Anyway,
- Read in all your k-mers (into a vector of strings).
- From these set up all the (k-1)-mer NODEs corresponding to first and last k-1 chars of the k-mers. Put these in an array or vector of NODE objects. A NODE class is going to have to store (a) its value (here, a string) and possibly (b) the collection of EDGEs that it starts (or pointers to the NODEs that these go to).
- Remove any duplicates from the (k-1)-mers.
- From your given k-mers set up all the EDGEs needed to connect the (k-1) mers.
- From these NODEs and EDGEs carry out the Euler path algorithm given in that presentation to get the shortest superstring (or superstrings if there are multiple disconnected subgraphs).
- Write out your shortest superstring.

Not sure if that is much of a help. This seems quite an advanced computational project. Google "Graph theory in C++" to see appropriate forms for your NODE class and how to store and use network GRAPHs (effectively, NODEs and EDGEs). EDIT: looks like @kemort has provided a good link below.

It is possible that you will get some lectures on coding network graphs during your course. Seems rather a lot to take on without any guidance.
Last edited on
Thanks @lastchance for the guidance. I'll google your suggestion.

@keskiverto It actually does not. Another student in my program took this class and had no programming experience. However, her husband is a programmer so she had help.
closed account (48T7M4Gy)
Euler paths in Java or C++ renditions are treated in Algorithms, a book by Princeton Prof Robert Sedgewick et al. There is a website for the Java version of their book.

The Knuth series of books Art of Computer Programming Vol 1 also has a section on Euler paths graphs etc.
closed account (48T7M4Gy)
http://www.geeksforgeeks.org/euler-circuit-directed-graph/
Thanks for all the hints. I'm still racking my head with all of this. To make things a bit more complicated, Im trying to compile my script (reading in the kmers) and it is telling me the following:

ld: warning: ignoring file readfile.txt, file was built for unsupported file format ( 0x23 0x69 0x6E 0x63 0x6C 0x75 0x64 0x65 0x20 0x3C 0x69 0x6F 0x73 0x74 0x72 0x65 ) which is not the architecture being linked (x86_64): readfile.txt
Undefined symbols for architecture x86_64:
"_main", referenced from:
implicit entry/start for main executable
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Can someone translate this to simple English, please. I'm googling all of this and it says that I need to change the libraries that are being called. I'm on a macOS recently updated to Sierra (specifically for this class).

Here is my "baby" code. Im following the steps that were outlined by @lastchance, using example codes and modifying a bit, based on my extremely limited knowledge:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

void main() {

ifstream file (“kmers.txt”);
string content;

while(file >>content) {
cout << content << ‘ ‘;
}
return 0;
}
This should fix your typo's

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main() 
{

ifstream file ("kmers.txt");
string content;

while(file >>content) 
{
cout << content << " ";
}
return 0;
}
Thanks @SamuelAdams.

Still getting same error, though, when trying to compile.
closed account (48T7M4Gy)
Did you change void main() to int main()?
Yes, this is the code now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 #include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main() 
{

ifstream file ("kmers.txt");
string content;

while(file >>content) 
{
cout << content << " ";
}
return 0;
}
You know, I hate to come in and say this, but for the stuff you are doing and the relative inexperience with C++ — you should probably be playing with R and/or Python, or even Matlab. Those languages are designed to handle stuff like this very easily. In C and C++, you’d need to do a lot of work to set up the libraries to do the same.
@Duthomhas no offense taken. I know a bit of R but, unfortunately, they don't want us doing anything in R. They do allow Python. I think I'll go ahead and give it a try while I try to figure this out.
Topic archived. No new replies allowed.