Header files and symbol tables

Hey guys,

So obviously header files are used to separate or distinguish interface from
implementation, but they have other useful properties, that is they help
the linker resolve function calls.

If you call a function and do not include the header that function
was defined in the compiler would have no way to link or find that
function definition, this is done with the help of a symbol table

but my question is how exactly is this done? lets say we have
a main.cpp and another.cpp and another.h, we include the another.h
in main.cpp, lets pretend we have a function named doSomething() in
another.cpp with the definition being defined in the .cpp file and
of course it's forward declaration being declared in the header.

so what really happens? when we call that doSomething function in
main.cpp how does the compiler know what memory address ( memory address
of that function ) to go to? also lets say that we forgot to
include another.h, what happens?

this may require a rather long answer but I'm all ears( or in this case
eyes )

thanks
Oooh, an actual non-beginner question in the General forum. TL;DR: The compiler knows nothing about what address to go to.

First off, header files are pretty much irrelevant to the linking phase. They save you from having to retype a lot of declarations (to say nothing of templates), but you can write a program entirely without them. You'd just have to manually write out the relevant declarations. If you forget a declaration (e.g. by forgetting to include a header file), then the compiler will complain about you using something that you never told it about ahead of time.

A compiler does not directly produce executables/libraries (well, most compilers don't). Instead, it produces object code, which is at least one step away from the final executable/library. Object files contain machine code wrapped in a container that includes, among other things, a table that maps symbols to relative addresses, relative to the start of the file, which can also be used to indicate that a symbol exists but isn't defined in that file. The exact format of object code varies from OS to OS.

That object code is then passed to a linker, which is responsible for merging your object files into a more usable form. The relative addresses from above get reassigned to different relative addresses, or to absolute (read: usable) addresses. The linker can also complain about certain symbols not being found by the time it's done (e.g. undefined functions), or about two files being impossible to merge due to conflicts (e.g. two functions bearing the same name but otherwise being different).

Often, this linking happens twice: once when making the executable, and a second time just before/while running the executable (to link in DLLs/shared objects). Usually the program you call to compile your code will run both the compiler and the first linking pass for you.

Final note: when you name something in C++, that is not the same as the name the compiler produces. In order for namespaces and function overloading to work, the C++ compiler will transform (a.k.a. mangle) the names you give it into forms that can be translated back to the original name, and that can be used during linking. Compilers may differ on how they do this: MSVC++'s compiler differs from g++, for instance.

-Albatross
I'm doing my best not to sound articulate on this subject and as you said: So in a sense, the relative addresses in the object files which are produced by the compiler contain symbols that map the functions to relative addresses in that program but technically these are not addresses but rather offsets at where the function definition begins relative to the program?

so does the linker translate these symbols( still not entirely sure what a symbol even is haha ) that point to the starting point of a function definition to physical RAM? ( technically not physical as most OS's use logical addressing ) or is it the job of the OS itself to translate these symbols or assign these symbols to memory locations?

I could be way off here? but thanks for the help.
Last edited on
A symbol is basically a fancy way of saying "unambiguous name". A symbol table is an associative structure that maps symbols to information about what those symbols represent (like their types, addresses, visibility, and so on). Apologies if that wasn't clear. Most of the above was about the symbol tables found in object files, not the ones that the compiler creates and uses internally.

You're right, such a relative address isn't the same kind of address as would be stored in your usual pointer variable. The compiler has no idea at what address your program will be loaded unless you're not compiling for a multi-process OS, so that's all it can really produce. It can tell you where to find stuff in its output, but can't tell you where that stuff will be when the program is loaded1. Makes sense?

The process of translating those addresses to usable absolute addresses is a form of relocation and, yes, it's often done by the dynamic linker. To be clear, such a linker is a key part of most OSes, responsible for loading as much as it is linking.

I hope that helps.

-Albatross

1: I'm oversimplifying here. It's possible for a compiler to generate code that will work regardless of what address it's loaded at (i.e. position-independent code), without needing load-time relocation aside from the entry point. There are both reasons to do this and not to do this. But let's not go too deep down the OS rabbit hole at a time.
Last edited on
yes that actually kind of makes sense :)

so lets say we have two functions

1
2
3
4
5
6
7
8
9
10
11

void sum(int a,int b){

  std::cout << a + b;
}

 void sum(double a,double b){

  std:: cout << a + b;
}


so the compiler will create symbols for both functions, obviously C++ mangles the names which allows us to use function overloading, so the compiler will create two symbols for these functions right? but how does the compiler or linker tell if the actual function is a definition or declaration, and lets say we call these functions but forget to include their declarations, how does the compiler confirm this, and why is the declaration so important to the compiler in regards to the declaration ( also in regards to symbols )

sorry if I'm gone off topic ,

thanks :)
Yes, both functions will end up with different symbols.

The declaration in a source/header file is important because it tells the compiler how a function is allowed to be used. It doesn't make sense to pass strings to either of your sum functions (without converting them first), but if you didn't declare those sum functions ahead of time and the compiler was to assume every use of a function was valid, how would it know any better? It wouldn't, and then the issue wouldn't be found until link-time, at which point its cause is much less obvious. This is a language design decision, little more.

Otherwise, verifying that a function was declared before it's used is conceptually pretty simple: just have a set of declared functions, and look up in that set if a function was declared before it was used whenever parsing a call. That's a gross simplification of how actual production-grade compilers handle this, but hopefully it'll do.

I'm not sure if the following will help or not, but here's an illustration (using actual code and non-toy tools) that should hopefully show what happens a bit better.

Let's take those functions you provided but strip out the cout statements, so that we don't have to add <iostream>. If you have a GNU/Linux system, you can follow along by running the following:
g++ -c sums.cpp && nm sums.o

Which will get you an output that looks like:
000000000000000d T _Z3sumdd
0000000000000000 T _Z3sumii
nm lists the symbols in an object file, along with some extra information about them. The long hexadecimal numbers at the start are the (non-relocated) addresses. Your integer sum function comes first (it's at address 0), followed by the double sum function (starting at address 13). The T specifies that both addresses are to executable code (as for why it's "T", that has to do with the name of the section they're in). Obviously these are not the addresses that will be used when your program runs, right?

Now let's split the file in two. Each file will have a declaration for one of the sum functions, and an implementation for the other. Each implementation will call the declared function. We'll run them through the same process.
1
2
3
4
void sum(int a,int b); //Declaration.
void sum(double a,double b) {
    sum(1,1);
}
                 U _GLOBAL_OFFSET_TABLE_
0000000000000000 T _Z3sumdd
                 U _Z3sumii

1
2
3
4
void sum(double a, double b); //Declaration.
void sum(int a, int b) {
    sum(1.0, 1.0);
}
                 U _GLOBAL_OFFSET_TABLE_
                 U _Z3sumdd
0000000000000000 T _Z3sumii


We're going to ignore the global offset tables for now (though they are important). As you can see, each object file has one function defined, and the other listed in the symbol table with an undefined address.

Say we were to partially link together the two object files (EDIT: using ld -r) to produce a new object file. Then, nm would print the following for the result:
                 U _GLOBAL_OFFSET_TABLE_
0000000000000000 T _Z3sumdd
0000000000000024 T _Z3sumii


Neat, no? If you omit the actual calls, the compiler won't put anything for those functions in the symbol table as, quite simply, the declarations go unused and it'd be wasteful otherwise.

-Albatross
Last edited on
sorry if I'm gone off topic ,


I'm going to turn this self-observation into a point.

Based on a few posts I see, I realize you're thinking hard on this, and I would assess that you're doing well as a student, but I assert that you're overthinking.

One of the techniques to learning this material is to allow the unknowns to remain unknowns in the larger algebra describing this knowledge in your mind, trusting that you'll pick up each of those unknowns as you proceed. You don't actually need an answer to each of the questions you're asking, because, as those of us with years of this behind us recognize, a lot of this will become clearer as you practice the art.

I run the risk of that sounding like an instruction to stop asking so many questions.

Perhaps it is to some extent, but that would be way too far.

As I've read from your other posts, you have a growing list of these questions which form a long chain of inquiry, some of which require several different areas of study to answer (like the subject of project organization, object file content, etc).

What I know from experience is that you can actually trip up your advancement if you focus on each of these more than the more general flow of study (through the computer science of algorithms, like qsort or binary searching, through data structures like binary trees), or through the general and mundane practice of using IDE's and build systems.

However, I must contradict myself, as this is a paradox. You must ask these questions, and there should be no end to curiosity. I'm relieved, to some extent, that you haven't inquired about the assembler results from the compiler, about optimizations it produces, or how the CPU performs multiplication in hardware on multiple statements at one time, out of order, in a single core.

...and I probably shouldn't have said that :)


That said:

C / C++ is a single pass compiler. This is why declarations are required. Some other languages use multiple passes, and in those compilers they assume that if you call a function, it must be found somewhere, and don't complain. In C/C++, however, if you write a call to a function that has not been declared, the compiler issues an error message. This fully answers your question (and run-on)...

"but how does the compiler or linker tell if the actual function is a definition or declaration..."

A definition includes code to be compiled. A declaration is just the signature. These are easily recognizable from each other. The declaration is akin to an entry in the table of contents. Somewhere, later, that entry from the table is found within a chapter.

"...lets say we call these functions but forget to include their declarations..."

The compiler complains because the call to said function(s) are not declared. It is that simple.

Put another way, you could analogize the idea of function names (signatures) as a kind of dictionary, a vocabulary. If you use a word that has not been declared, it isn't in the vocabulary being built as the program is compiled - it is like a word you want to look up but realize it isn't in the dictionary. That is an error in to the compiler. If a function is to be called, there must, at a minimum, have been a declaration which makes it a "legal" word in the dictionary.

"..and why is the declaration so important to the compiler in regards to the declaration.."

This is actually answered above, but to be clear, it is important because the compiler only makes one pass through the code. The function declaration must appear before it can be called, just as a variable declaration must appear before it can be read or written to.


Last edited on
@Albatross ahh ok, so in a rather simplified and abstracted view, a declaration will create a symbol but it's address will be empty unless there is a function body or definition , if a definition is found I'm guessing, if the compiler finds a body ( {} ) it will then indicate the address( where in the program ) at where the function resides. Now in your second example we split the functions up into two separate cpp source files and defined both the sum(int,int) and sum(double,double) in their respective source files. We also made two forward declarations, one in both files. we then link both source files to create a new object file. We call sum(int,int) in sum(double,double) so the linker will check both source files for a definition for sum(int,int) in both files? if it finds it everything will be fine, if not we get a linker error?

@Niccolo , very true. I think eventually everything will start to fall in the place. I think I could be diving into the deep end instead of letting things fall into place. I've worked on a few semi large projects with teams in college but never with C++ or C and never to the point where our projects supported cross platform capabilities. I think with experience it will eventually click, that being said is there any books or sources that took you from beginner++/ intermediate to proficient in the language where you understand these more "advanced" subject matter. I have read a couple of C and C++ books and plenty of other programming books ( other languages ) but none detail the nuances like the ones I ask in some of my questions.



or how the CPU performs multiplication in hardware on multiple statements at one time, out of order, in a single core.


The CPU uses adders for multiplication right? there is no instruction( as far I remember or from the book I read which is pretty outdated) for multiplication. so lets say 5*3 would actually be 5 + 5 + 5 to a CPU.

thanks guys for your time! =)
Last edited on
> The CPU uses adders for multiplication right? there is no instruction
> ( as far I remember or from the book I read which is pretty outdated) for multiplication
Your book is very old!
https://en.wikipedia.org/wiki/Binary_multiplier
Even in s/w, multiply used various "shift-and-add" techniques.

> so lets say 5*3 would actually be 5 + 5 + 5 to a CPU.
Imagine how that would slow down as the operands got bigger.
The CPU uses adders for multiplication right? there is no instruction...for multiplication.


Nearly all modern CPU's have multiplication instructions, and a lot more.

The evolution itself is quite interesting, starting with CPU's of the ancient 1950's, most did not have multiplication in hardware. It was a particular point to check in the purchase of a high performance computer in the 60's and 70's that hardware multiplication was (or wasn't) a feature.

The CPU in the first Apple II, and the one in the TRS-80, the Altair, a number of those 8 bit CPU's from the 70's, did not have hardware multiplication.

Nearly all modern CPU's since the 8086 and 8088 do have hardware multiplication for integers at the very least.

Floating point hardware, even for addition and subtraction, was generally optional from the 8086 (about 1980) until the 80486 around 1990. When floating point hardware became, generally, standard, the entire compliment of operations from multiplication through square root, with the standard fare of trig functions, came together in one package.

Then, it goes further.

Up to that point the operation was generally like that of any calculator in which two operands were treated with some chosen operation to produce a result. This means a statement like this:

a = b + c + d;

Is performed in at least two steps. An addition for, say, c + d produces a temporary result which is then used for the addition with b. This may seem obvious, but there are a number of very common formulae for which this burden is a significant performance problem.

This lead to the inclusion of vector or SIMD instructions, themselves from examples dating back to the late 60's and early 70's on high performance machines.

There are a select number of patterns, but this illustrates what they are like:

a = b * c + d * e;

There are, of course, many that are more complex. This is typical of linear algebra and some scientific calculations. A slightly different pattern may be something like:

1
2
3
a.x = b.x + c.x;
a.y = b.y + c.y;
a.z = b.z + c.z;


It isn't important to detail the actual instructions, but to generally note that where such common patterns would lead to a series of instructions on the standard CPU compliment, here, in a vector processor or an SIMD processor, there are instructions which process this entire formula. We can fill registers with the operands, call one instruction and get the result as a single step.

In the Intel line, these have followed a progression, each with expanding features, with names like "MMX" and "SSE" (in various versions).

Yet, look again at the last example sequence which is adding two 3D points. Note that the second has nothing to do with the first, and the third has nothing to do with with either of the others, except the fact they are adjacent within a structure holding x, y and z.

Most modern CPU's can schedule these additions (and other instructions) to be executed simultaneously in a single core, even though they are written by the compiler as a series of additions. This seems to do something like the SIMD instructions, where a collection of simple operations may be performed at once. The concepts are somewhat loosely related in that there is a performance gain by performing activity in some parallel fashion where possible, but the SIMD or vector instructions pack more work into smaller instructions defining some particular formula or pattern, like the process for multiplying a vector and a matrix. The latter will end up being more efficient, but those instructions are usually called upon rather deliberately. The ability of the CPU to schedule multiple simple operations for simultaneous execution is somewhat automatic, and independent of any intended formula.

The 8 bit CPU's of the 70's were comprised of some 3,500 transistors, each performing some portion of the work on individual bits. Most modern CPU's are comprised of a few billion transistors, with considerably more sophisticated features that complete the basic work far more speedily than their ancestors.

...sources that took you from beginner++/ intermediate to proficient in the language where you understand these more "advanced" subject matter...


I'm beginning to think at least some of us should write one.

I come from an era where assembling a computer required a soldering iron and weeks of patience.

I do consider it an asset to have emerged from primitive and relatively naive hardware. I've watched the industry expand many orders of magnitude.

Don Knuth's work may be of interest, as well as other computer scientists' reflections.





Last edited on
Suppose a.cpp has a function foo:
1
2
void foo()
{ do_some_stuff; }

And b.cpp calls foo:
1
2
3
...
foo();
...


When the compiler creates the object file b.o, it includes something like this:
Addr  Data
100   Call instruction
104   00 00 00 00 00 00 00 00
...
Symbol table:
Addr  Symbol
104    foo


When the linker combines a.o and b.o into a program it puts the data from the object code files together, one after the other. Let's say that function foo (from a.o) ends of at address 62 and b.o ends up starting at address 400. At this point, it know where function foo will be located. So now it goes through the symbol tables:

- Hmm. b.o's symbol table says that at address 104 (in b.o) is a reference to symbol foo.
- Symbol foo is located at address 62 in the program.
- Address 104 is b.o is now at address 400+104 = 504 in the program (because b.o starts at address 400
- So at address 504 in the program, I will write the value 62.

That's how it's done.

Now, this assumes that the program will always be placed at address 0 when it runs. On most modern computers, programs run in a virtual memory space so they always start at the same (virtual) address. There's an exception for dynamically loaded libraries but lets ignore that for now. For computers where the program can be run from different locations, the linker doesn't write the actual address for the symbol, but rather it writes the offset from the beginning of the program. The exe has it's own symbol table and when the OS loads the program into it's final running address, it modifies the addresses by adding the beginning address where the program actually runs.

This stuff is covered in compiler theory and computer architecture, but are more advanced classes.
thanks dHayden :) I think it's also something as others said that I should pick up with time.

@Niccolo wow then yes my book is very outdated lol, when the author wrote the book it was back in the time of windows3.1 when GUI were a new invention. Basically the book is almost the same age as me lol

Topic archived. No new replies allowed.