Hello everyone!
I'm trying to implement adadelta to my simple feed forward neural network
but I think I'm having some troubles understanding the article.
http://arxiv.org/pdf/1212.5701v1.pdf
Its a small article explaining/introducing adadelta algorithm.
Only 1 and a half pages are focused on formulas.
Starting from part:
"Algorithm 1 Computing ADADELTA update at time t"
Question 1
'3: Compute Gradient: gt'
How exactly do I calculate gradient here?
Is my way correct:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
/* calculating gradient value for neuron what is inside the hidden layer
gradient = sum of ( outcoming connection's target gradient * outcoming connection's weight ) * derivative function
*/
double CalculatHiddenGradient() {
double sum = 0.0;
for (int i = 0; i <OutcomingConnections.size(); i++) {
sum += OutcomingConnections[i]->weight * OutcomingConnections[i]->target->gradient;
}
return (1.0 - output * output) * sum; // tanh's derivative function
}
// calculating gradient value for output neurons where we know the desired output value
double CalculatGradient(double TargetOutput) {
return (TargetOutput - output) * (1.0 - output * output);
}
|
Question 2
'5: Compute Update: ∆xt'
formula (14)
says following:
∆xt = -( RMS[∆x]t−1) / RMS[g]t) * gt;
is the RMS[∆x]t−1 calculating as followings:
RMS[∆x]t−1 = sqrt( E[∆x²]t-1 + e )
taking the body from formula (9)
Based on what I corrently understand I was able to write this piece of code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
|
class AdaDelta {
private:
vector<double> Eg; // E[g²]
vector<double> Ex; // E[∆x²]
vector<double> g; // gradient
int windowsize;
double p; // Decay rate ρ
double e; // Constant e, epsilon?
public:
AdaDelta(int WindowSize = 32, double DecayRate = 0.95, double ConstantE = 0.001) { // initalizing variables
Eg.reserve(WindowSize + 1);
Ex.reserve(WindowSize + 1);
Eg.push_back(0.0); // E[g²]t
Ex.push_back(0.0); // E[∆x²]t
g.push_back(0.0); // (gradient)t
windowsize = WindowSize; // common value:?
p = DecayRate; // common value:0.95
e = ConstantE; // common value:0.001
}
// Does it return weight update value?
double CalculateUpdated(double gradient) {
double dx;
int t;
// for t = 1 : T do %% Loop over # of updates
for (t = 1; t < Eg.size(); t++) {
// Accumulate Gradient
Eg[t] = ( p * Eg[t - 1] + (1.0 - p) * (g[t] * g[t]));
// Compute Update
dx = -(sqrt(Ex[t - 1] + e)/sqrt(Eg[t] + e)) * g[t];
// Accumulate Updates
Ex[t] = Ex[t - 1] + (1.0 - p) * (dx * dx);
}
/* calculate new update
=================================== */
t = g.size();
g.push_back(gradient);
// Accumulate Gradient
Eg.push_back( (p * Eg[t - 1] + (1.0 - p) * (g[t] * g[t])) );
// Compute Update
dx = -(sqrt(Ex[t - 1] + e) / sqrt(Eg[t] + e)) * g[t];
// Accumulate Updates
Ex.push_back( Ex[t - 1] + (1.0 - p) * (dx * dx));
// Deleting adadelta updates when window has grown bigger than we allow
if (g.size() >= windowsize) {
Eg[1] = 0.0;
Ex[1] = 0.0;
Eg.erase(Eg.begin());
Ex.erase(Ex.begin());
g.erase(g.begin());
}
return dx;
}
};
|
Question 3
In backpropagation, updating weight goes like this
target's gradient * source's output * learning rate
but in adadelta algorithm I don't see that action.
Should I mix the source's output with target's gradient before calling the
CalculateUpdated() function or should I mix the output with returned value to get new weight value?
Question 4
A part what got me confused all the way
"3.2. Idea 2: Correct Units with Hessian Approximation"
I got very confused here because I don't quite understand
what formula part are we updating here or what changes?
formula (13)
∆x = (∆x/∂f)/∂x;
Question 5
what does ∆x, ∂f, ∂x stand for in formula (13)?
Thanks!