"Registering" types to be used at run-time

I'm writing a program that needs to parse executable files. I've got an "executable" base-class, and currently an "elf" class which inherits from it for parsing ELF files, and I will add more parsers (COM, MZ, PE, a.out, MACH-O, whatever) later on.

I want the program to automatically detect which kind of executable it's loading at runtime. It should be easy because every executable format I'm aware of/plan to support starts with a magic number. But because I can't have the parsers not check the file type (what if I re-use the code?), and I don't want to check each file twice (not just for performance, but also because only the ELF parser should know that ELF files start with "\x7fELF", etc.) so I've come up with a pretty lazy solution: just try to parse the file with each known parser and have them throw an exception ("exe_type_error") if they can't parse it. If that exception gets thrown, try the next parser; if not, stop.

The remaining problem is how, at runtime, my program will know what parsers are available. I don't want to hard-code it in the main function; instead, I'd like the parsers to "register" themselves as available. That way, if I decide to go down the route of adding new parsers via dynamic linking, I will only have to add an API for dynamic libraries to register their parser, without recompiling any of the main program's code. I also want to do the same thing for another key part of the program (it's a static executable optimizer; it will run a series of "tests" (e.g. "is xor eax, eax faster than mov eax, 0 on this machine?") and optimizations ("if yes, change all mov eax, 0 to xor eax, eax") and I want to load those at runtime too).
Last edited on
So you want a collection of parsers. The register() function will just add a parser to the collection and your main() function will loop through the collection calling the parsers (which will need virtual methods of course).

You're trying to do something that is harder than you may realize. How will you disassemble the code? Not all bytes in the executable section are code: there can be jump tables, constants etc. If you replace an instruction with a new one of a different size, then lots of your branches will now go to the wrong place. How will you clean them up? If you plan to replace sequences of instructions then you have to ensure that the program doesn't jump into the middle of the sequence. How will you do that?
What about something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <map>
using namespace std;

class parserBase  
{   
public:
    parserBase ();
    virtual long getMagic () const = 0;
    virtual void Execute (istream & is) = 0;
};

class ElfParser : public parserBase 
{   static const long elf_magic_number = 123456;
public:
    long getMagic () const
    { return elf_magic_number; } 
    void Execute (istream & is);
};

//  Additional derived classes for other parsers

map<long,parserBase *> parsertable;

void registerParsers ()
{   parserBase * parser;  
    long		magic;

    parser = new ElfParser ();
    magic = parser->getMagic();	//  Elf parser returns it's magic number
    parsertable.insert (pair<long,parserBase *> (magic, parser));
    //  register additional parsers 
}

void parsefile (istream & is)
{   long	magic;
    map<long,parserBase *>::iterator	iter;

    is >> magic;  // read magic from file
    iter = parsertable.find (magic);  // Search map for magic number
    if (iter == parsertable.end())
    {	// magic number not found
    }
    iter->second->Execute (is); 
}


edit: I realize this doesn't achieve your goal of not modifying the program to add additional parsers, but at least the impact is limited to replicating lines 28-30.

Now if you take the approach of dynamically loading the parser from a DLL, you still have the issue of relating the magic number of a DLL to a file name. If you're clever, you could possibly come up with an encoding scheme that encodes a magic number into a file name. By doing that, you could read the magic number and determine the DLL's file name without having any kind of lookup table.
Last edited on
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class MasterParser
{
    public:
        MasterParser() {}
        void registerParser(Parser* parser)
        {
            //Check parser not in parser_list
            //If not...
            parser_list.push_back(parser);
        }

        void parse(std::string path_to_parsee)
        {
             //Iterate through parser_list until you find the right one.
        }

    private:
        std::vector<Parser*> parser_list;
};

class Parser
{
    virtual ~Parser() {}
    virtual void parse() = 0;
};


The idea is have all parsers follow the Parser interface, then if a new parser is needed at run-time, just call the registerParser() method and pass in your new parser. Obviously you'll want to make this nicer, but I think this is the basics of what you need if I understood what you're asking. The program in general is a topic which I know hardly anything about, so I could have completely misunderstood what you're asking.
@All
These suggestions are similar to what I have now, but my problem is how (in ResidentBiscuit's example) does the registerParser function get called?

@dhayden
I hadn't thought about jump tables and branches. It will be challenging but I'll get there. I will scan for branches and make a list of offsets that get branched to. Then I'll track how the offsets change. Finally I'll scan for branches again and fix the offsets.

I will say self-modifying programs are unsupported :P
Hmm, what if MasterParser is a singleton? Just brainstorming here.
I thought about doing something with hidden static variables because static variables are always initialised when the program is loaded, but the problem there is that the order is not deterministic, so whatever list the objects get added to may not be constructed until after the programs try to add themselves to it. Unless there's a rule I'm unaware of that says objects of base classes are constructed before objects of their derived classes, or something like that.
closed account (10X9216C)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// init.h

#pragma once

#include <typeinfo>

#include <iostream>

template<typename T>
class ParserInit
{
    static int count;
public:

    ParserInit()
    {
        if(!count++)
        {
            std::cout << "add here " << typeid(T).name() << std::endl;
        }
    }

    ~ParserInit()
    {
        if(!--count)
        {
            std::cout << "destroy here" << typeid(T).name() << std::endl;
        }
    }

};

template<typename T> int ParserInit<T>::count;


1
2
3
4
5
6
7
8
9
10
11
// parser.h

#pragma once

class Parser
{
public:

    virtual ~Parser() = 0;

};


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// parser1.h


#pragma once

#include "init.h"
#include "parser.h"

class Parser1 : public Parser
{

};

namespace
{
    ParserInit<Parser1> __parser1_initializer__;
}


1
2
3
// parser1.cpp

#include "parser1.h" 


1
2
3
4
5
6
// main.cpp

int main()
{

}



You'd then need to do an initialize for the std::map or vector or w/e.

I think having an initializer list somewhere in code with all the parsers would be the easiest to maintain, though.

Last edited on
Assuming you are going to be using each parser as a seperate DLL, you could get around the problem of associating the filename of each DLL with their magic number by just giving the name of the DLL of the magic number, as a hex string. Then, when you open the executable you parse for the magic number and run the DLL associated with that magic number.

For example, for ELF, you would call the DLL 37454C46.dll. Obviously, the same thing could be done with shared objects on other operating systems, too.
> how, at runtime, my program will know what parsers are available.

You might want to read Chapter 8. Programming with Exemplars in C++
in Coplien's 20 year old classic: 'Advanced C++ Programming Styles and Idioms'
http://www.amazon.com/Advanced-C-Programming-Styles-Idioms/dp/0201548550
I will scan for branches and make a list of offsets that get branched to.

This is much harder than you may realize. What if the target of the branch is not a constant? This happens a lot - such as with a virtual function call.

I worked for a company that did this 20 years ago. It took several years to get the code working right and even then it required human intervention. Go for it, but realize that it's a big problem to tackle.
@myesolar
How can I be sure that the container is initialised before those constructors attempt to add themselves to it? AFAIK, using global variables from global constructors is undefined behaviour because the order that globals are constructed in is not deterministic.

@NT3
That still means reading each file twice, plus the filenames are awkward and look suspicious (if I downloaded a program with a bunch of DLLs with numeric names, rationally or not, I would be suspicious).

@JLBorges
Thanks for the recommendation. Unfortunately my budget is literally 0 right now, but now that I know what the general concept is I'll see if I can dig something up.

@dhayden
Thanks for the advice/info.

There is a program called exeopt ( http://timeslip.users.sourceforge.net/exeopt.html ) which was created for optimising (mainly) the game Morrowind. I wanted to generalise it to work with different programs, file formats, and eventually architectures.
Topic archived. No new replies allowed.