Trying to understand adadelta training algorithm

Hello everyone!

I'm trying to implement adadelta to my simple feed forward neural network
but I think I'm having some troubles understanding the article.

http://arxiv.org/pdf/1212.5701v1.pdf
Its a small article explaining/introducing adadelta algorithm.
Only 1 and a half pages are focused on formulas.

Starting from part:
"Algorithm 1 Computing ADADELTA update at time t"

Question 1
'3: Compute Gradient: gt'

How exactly do I calculate gradient here?
Is my way correct:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* calculating gradient value for neuron what is inside the hidden layer

gradient = sum of ( outcoming connection's target gradient * outcoming connection's weight ) * derivative function
*/
double CalculatHiddenGradient() {
	double sum = 0.0;
	for (int i = 0; i <OutcomingConnections.size(); i++) {
		sum += OutcomingConnections[i]->weight * OutcomingConnections[i]->target->gradient;
	}
	return (1.0 - output * output) * sum; // tanh's derivative function
}

// calculating gradient value for output neurons where we know the desired output value
double CalculatGradient(double TargetOutput) {
	return (TargetOutput - output) * (1.0 - output * output);
}


Question 2
'5: Compute Update: ∆xt'

formula (14)
says following:
∆xt = -( RMS[∆x]t−1) / RMS[g]t) * gt;

is the RMS[∆x]t−1 calculating as followings:

RMS[∆x]t−1 = sqrt( E[∆x²]t-1 + e )
taking the body from formula (9)


Based on what I corrently understand I was able to write this piece of code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
class AdaDelta {
private:
	vector<double> Eg; // E[g²]
	vector<double> Ex; // E[∆x²]
	vector<double> g; // gradient
	int windowsize;
	double p; // Decay rate ρ
	double e; // Constant e, epsilon?

public:
	AdaDelta(int WindowSize = 32, double DecayRate = 0.95, double ConstantE = 0.001) { // initalizing variables
		Eg.reserve(WindowSize + 1);
		Ex.reserve(WindowSize + 1);

		Eg.push_back(0.0); // E[g²]t
		Ex.push_back(0.0); // E[∆x²]t
		g.push_back(0.0); // (gradient)t

		windowsize = WindowSize; // common value:?

		p = DecayRate; // common value:0.95
		e = ConstantE; // common value:0.001
	}

	// Does it return weight update value?
	double CalculateUpdated(double gradient) {
		double dx;
		int t;

		// for t = 1 : T do %% Loop over # of updates
		for (t = 1; t < Eg.size(); t++) {

			// Accumulate Gradient
			Eg[t] = ( p * Eg[t - 1] + (1.0 - p) * (g[t] * g[t]));

			// Compute Update
			dx = -(sqrt(Ex[t - 1] + e)/sqrt(Eg[t] + e)) * g[t];

			// Accumulate Updates
			Ex[t] = Ex[t - 1] + (1.0 - p) * (dx * dx);
		}

		/* calculate new update 
		=================================== */
		t = g.size();
		g.push_back(gradient);

		// Accumulate Gradient
		Eg.push_back( (p * Eg[t - 1] + (1.0 - p) * (g[t] * g[t])) );

		// Compute Update
		dx = -(sqrt(Ex[t - 1] + e) / sqrt(Eg[t] + e)) * g[t];

		// Accumulate Updates
		Ex.push_back( Ex[t - 1] + (1.0 - p) * (dx * dx));

		// Deleting adadelta updates when window has grown bigger than we allow
		if (g.size() >= windowsize) {
			Eg[1] = 0.0;
			Ex[1] = 0.0;
			Eg.erase(Eg.begin());
			Ex.erase(Ex.begin());
			g.erase(g.begin());

		}
                return dx;
	}
};


Question 3
In backpropagation, updating weight goes like this
target's gradient * source's output * learning rate
but in adadelta algorithm I don't see that action.

Should I mix the source's output with target's gradient before calling the
CalculateUpdated() function or should I mix the output with returned value to get new weight value?

Question 4
A part what got me confused all the way
"3.2. Idea 2: Correct Units with Hessian Approximation"

I got very confused here because I don't quite understand
what formula part are we updating here or what changes?

formula (13)
∆x = (∆x/∂f)/∂x;

Question 5
what does ∆x, ∂f, ∂x stand for in formula (13)?

Thanks!


Last edited on
closed account (48bpfSEw)
may be this article can help you understand the subject better:
http://sebastianruder.com/optimizing-gradient-descent/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
	class adadelta {
	public:
		double Eg;
		double Ex;
		double Update( double output, double gradient, double p, double e ){
			double g = output * gradient;
			Eg = p * Eg + (1 - p) * (g * g);
			double x = (sqrt(Ex + e) / sqrt(Eg + e)) * g;
			Ex = p * Ex + (1 - p) * (x * x);
			return x;
		}

		adadelta() {
			Eg = 0.0;
			Ex = 0.0;
		}
		~adadelta() {}

	};


this is my current code. It seems to be doing worse than momentum from the link you have gave me.

I got huge help from here:
https://www.quora.com/How-is-the-update-step-size-for-the-ADADELTA-backpropagation-algorithm-determined

Now the question is.
In adadelta paper: http://arxiv.org/pdf/1212.5701v1.pdf

calculating ∆xt formula there is minus sign in front of the division how ever if I change my
code part:

 
double x = (sqrt(Ex + e) / sqrt(Eg + e)) * g;


to
 
double x = -(sqrt(Ex + e) / sqrt(Eg + e)) * g;


Algorithm doesn't learn anymore.
Why does paper show that the division of sqrt accumulates has to be negative?

About Question 5,
I Still haven't figuered what ∂f stands for.

I looked up math:derivative and understand now that they want derivative of x * f.
x should stand for double x from my code but what about f?

Edit:
from: https://www.quora.com/How-is-the-update-step-size-for-the-ADADELTA-backpropagation-algorithm-determined

while calculating gradient (from my code):
double g = output * gradient;
I left out one part from calculation from the page:
gradW = f'(Wx) * x * error

I don't really understand what f'(Wx) is.
So far I understand that x is somesort of input and W is weight matrix.

If we are currently uploading neuron's x connection's weight then
input should be the neuron's x output correct?
but weight matrix? Is it incoming or outcoming connections for neuron x or is it something else?

Thanks!

Last edited on
closed account (48T7M4Gy)
It seems to me that you are trying to translate the mathematics given in the pdf paper to C++ code in the hope that there is a simple one to one relationship and you don't need to understand the mathematics. Regrettably it's not that simple.

If you are not across calculus then your task is next to impossible. f' is the derivative operator, grad is the gradient operator which is a specialised version of that and so it goes on.

https://en.m.wikipedia.org/wiki/Calculus A good overview, history and pointers to detail
https://en.m.wikipedia.org/wiki/Gradient Another good start

None of this means you should give up. It just means maybe you should put the coding to one side and get a handle on the mathematics. The middle way in learning might be Mathematica or similar to quickly track the learning curve - from memory Sourceforge has open source mathematics software.
I'm thankful for you reply but you can't help me here without knowning about deep learning and neural networks.

You got me wrong. Its not about not understanding mathematical rules(Calculus like you pointed out), its about not understanding mathematics because something had left unexplained on formula.

First of all, it is using stochastic gradient descent what is much more different than gradient from your link. I don't recommend reading about gradient because it might bring more confusions when trying to understand how neural networks work.

I am sorry that I made it seem like I'm completly helpless with math here.
When I asked about f' then I wanted to know about its components matrix W and x.

About Question 5 f, there f should mean the cost function but the problem is that they don't define it. There are many types of cost functions what are commonly used in neural networks.

It seems that I started from the middle and there are alot of basics ( things they expect you to know, in paper ) what I have missed and I can't seem to find the bottom.

I feel like I need to find a good book for deep learning. Often books about deep learning go way far away from the area I'm currently focused on because machine learning is a really really big area.
Last edited on
closed account (48T7M4Gy)
Well, the paper assumes you read the language of mathematics and concentrates on the authors interests which happen to be described mathematically. The paper doesn' work in the opposite direction.

You don't have to apologise, all I am doing is pointing out where your expectations from the article are way beyond being a straightforward programming exercise and perhaps even your ability to read the mathematical implications involved in it.

For instance the dash after f has significance, ie f' is the derivative. It's not a simple mathematical rule - if only calculus was that easy.

Your problem in programming f'(Wx) is quite rightly knowing what Wx is, it is a weight function they don't give you as far as I can see because the authors are writing generally. You have to generate it via the 'rules' involved in training a neural network or make one up. You then have to go a step further and differentiate it wrt x. And, worse still, that is only part of the grad function. It has virtually nothing to do with the specific case of whether it is 'stochastic gradient descent' or any other gradient - all they've done is picked a name to suit what they are doing. The principle is the same - and easily applied. Same for the term 'cost function' - there's nothing wrong with it but it's just network jargon that suits that system.

Maybe you've seen this freebie - there are millions doing likewise with mathematics instead of having to wait for a 'mathematics pill' Good luck :)
http://www.dkriesel.com/_media/science/neuronalenetze-en-zeta2-2col-dkrieselcom.pdf
Last edited on
Topic archived. No new replies allowed.