2-3 Tree insertion

I'm having some trouble making the jump from the insertion algorithm for a 2-3 tree to the actual code. Here is my understanding of the algorithm:

1. Traverse to the leaf to be inserted at.
2. If leaf has one data item, insert 2nd item. (Base case, no further action).
3. If leaf has two data items, sort d1, item, and d2.
Middle value is pushed up to the parent node, while d1 and d2 are split into two nodes.
4. If 3, parent node receives the middle value, and attempts to insert it. If parent node has two data items, step 3 repeats.

This occurs recursively until either the root node is reached (and split if needed) or a parent node with only one data item is reached.

The concept is clear to me (I think), but I'm having trouble understanding the exact process of splitting the node. It seems to clearly call for head recursion, as parent nodes need to be updated on the way back "up" the tree. If splitting the node happens in the current function call, how does the parent's other child pointers get updated? The return statement presumably returns the middle value so that the next function call up can attempt to insert it into its data array. Do all of the parent's child pointers get passed in as arguments, to be updated as needed? I've been looking for example code to trace through, but all I've been able to find are explanations of the algorithm.
Last edited on
Don't pass the "middle value" back. Pass the entire node. Let the previous function instance handle it since it has the necessary parent node to split it if it's become a 4-node.

This is just the idea, it may not be quite right:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
insert( node, data ):
    // find correct leaf node recursively
    if node is not a leaf:
        find the correct 'child' and call insert again:
            x = insert( child, data )  // 'x' for '(possibly) expanded node'

    // First time here node will be a leaf,
    // all subsequent times it will be an inner node

    if node is not a leaf  // on our way back up
        if x is a 4-node:
            split x  // node is x's parent
    else
        expand the node to accomodate new value
           (no pointer stuff here since all child pointers are null
            or possibly non-existent in a space-efficient implementation)
    return node

Would it be better to handle the splitting within the insert function, or would it be better to delegate that to another function? Or does it matter?
Whether or not it's a function technically doesn't matter, but a function shouldn't get too complicated. It can help to delegate.

What are your Nodes going to look like?
Initially, I planned on something like:

1
2
3
4
struct node23 {
     int ** data = new int*[2];
     node23 * = new node23[3];
}


But what I'm seeing makes it sound like the data array has to be able to hold 3 data items (the third being temporary); is that a correct impression?
Conceptually it needs to hold 4, but you may be able to avoid that. Maybe just make room for 4 for now, and if it becomes clear how to remove it then remove it.

I don't understand why data is a double pointer (some other stuff is mangled too).
It looks like you might mean something like:

1
2
3
4
struct node23 {
     int*     data = new int[3];
     node23** children = new node23*[4];
}

Although you could also just:

1
2
3
4
struct node23 {
     int data[3];
     node23* children[4];
}

Do you have a way to identify which type a node is?

Another design for insert() is to pass the parent into each invocation. That might be cleaner. I'll try to implement something (but I won't post the code unless you say so).
Last edited on
My intent with making the data pointers to ints rather than ints was to have a good way to indicate a null value, rather than just deciding that if data[i] == 0 it's empty (This is also a solution, but feels kludgy to me). This way, if data[i] == nullptr, then the node only has one data item. It seems like a good solution to me- am I wrong in that?

Ok, let me check my understanding:

Steps I have (Not in correct order):

1. If root is nullptr, return null.
2. node23 * ptr = insert(root->child[i], item); //Whichever branch is correct
3. If root is leaf && has 1 item, insert and return null.
4. If root is leaf && has two items, insert into data[2] and return root
5. If ptr == null, return null
6. If ptr, then create two new nodes, insert the small and large values from ptr->data into them, and assign those nodes to the appropriate child ptrs.
7. If 6, then insert middle value of child; return null if root has 1 item, otherwise insert into data[2] and return root
8. If root == this->root && ptr && root has two items, create new node, insert the middle value, then split root and assign its values and children to the left/mid/right children of the new node, then set this->root = to the new node.

Questions:

1. If insertion is attempted into a full node (2 data items), the extra value gets stored in data[2], which is only used for this purpose, correct?

2. The case where the root (local variable) is the root of the 2-3 tree and a new root node has to be created needs to be handled as a special case as well, right? My understanding is that this is the basis of how b-trees maintain consistent runtime.

3. Would it be useful to add a bool flag to the node struct indicating whether it's full? I can see it simplifying some of the operations.

I want to try to work this out from the algorithm, for now- I think I lean too much on example code. If I'm really, truly stuck, I'll ask for the implementation (If you have it).
I'm going to start coding this later today or tomorrow, and will undoubtedly post back with followup questions!
Last edited on
Instead of using 0 as an "unused" value (which is worse than kludgy unless 0 is truly an unused value in the greater application), you could use numeric_limits<int>::min(). That value is unique in that it is the only negative value without a corresponding positive representation, kind of an odd-man-out, so it makes a good "unused" value.

1
2
3
4
5
6
7
8
const int IntMin = std::numeric_limits<int>::min();

struct Node {
    int   data[3]; // is 2-node if data[1] == IntMin
                   // else is 3-node if data[2] == IntMin
                   // else is 4-node
    Node* child[4];
};


About your "steps":

1. Presumably if root is nullptr you should return a new 2-node with the given data. This will be the initial node in the tree. It's a special case since after this new data is always added to an existing node, expanding it, and then of course splitting it if necessary.
2. ...
3. If root is leaf 2-node, expand to 3-node and return node. Why do you always want to return null? You should never return null.
4. If root is leaf 3-node, insert into the proper position (not necessarily data[2])), creating a 4-node which must be split at some point.
5. Never happens. Will always hit a leaf before this.
6. I don't understand this.
...

Anyway. Try writing some real code.

Questions:

1. Not exactly. The data is inserted in the proper order, not necessarily data[2]. Also, if this is not a leaf node (and therefore has children) then the children also need to be rearranged.

Say we insert 4 into the following tree:

       10
  3,5      15

First we get a temporary 4-node

         10
  3,4,5      15

Then we split the 4-node:

    4 , 10
  3   5    15

Note that we not only move the middle value upwards (into the parent node),
but also rearrage the children, creating a new node.

If we insert 1 into the following tree:

      4 , 10
  2,3   5    15

We get:

        4 , 10          (A)
  1,2,3   5    15

Splitting once:
 
    2 , 4 , 10          (B)
  1   3   5    15

Splitting again:

        4               (C)
    2       10
  1   3   5    15

The first split (A) to (B) creates 1 new node.
The second split (B) to (C) creates 2 new nodes since it needed to create a
new root node. This is the only way new roots are created and the only way
the tree height increases. The tree grows upwards instead of downwards.


2. I'm not sure. I haven't implemented it.

3. I don't think so. (I assume "full" means a 4-node?) If you did that then you may as well add an int to tell you if it's a 2, 3, or 4 node.
Last edited on
Now that I've thought about it I think something like the following might do. The idea is to maintain a parent pointer. You could remove the recursion entirely since you have a linked-list back to the root if necessary. As for a 4-node (3 data items), you can probably handle the split right away and won't really need to make an actual 4-node.

I've used a special value for child[2] to mark the node as a 2-node.
nullptr is of course (usually?) zero, which is therefore an unused address.
Assuming 1 is also an unused address, I set Unused to (Node*)(1).
Other than a trick like this it seems like yet another data member is needed, and there's already too many.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>

struct Node23;
const auto Unused = reinterpret_cast<Node23*>(1);

struct Node23
{
    int     data[2];
    Node23* child[3];
    Node23* parent;

    // New nodes are always 2-nodes
    Node23(int d, Node23* p=nullptr, Node23* a=nullptr, Node23* b=nullptr)
        : data{d, 0}, child{a, b, Unused}, parent{p} { }

    bool is2node() const { return child[2] == Unused; }

    bool is_leaf() const { return child[0] == nullptr; }

    bool expand(int d) {
        if (!is2node())
            return false; // a 4-node; let someone else handle it
        if (d < data[0]) {
            data[1] = data[0];
            data[0] = d;
        }
        else
            data[1] = d;
        child[2] = nullptr; // mark as 3-node (overwriting Unused value)
        // If this is not a leaf not then this child and the others
        // will need to be properly set. For a leaf, set to nullptr.
        return true;
    }
};

void out(const char* s, bool b) {
    std::cout << s << "? " << (b ? "yes" : "no") << '\n';
}

int main() {
    auto a = new Node23(1);
    out("2node",  a->is2node());
    out("split", !a->expand(2));
    out("2node",  a->is2node());
    out("split", !a->expand(3));
}

Wow, that is elegant!
I coded the node like so:
1
2
3
4
5
6
7
8
9
struct node23 {                                                                                   
 13      int ** data; //Array of pointers to ints                                                    
 14      node23 ** child; //Array of pointers to child nodes                                       
 15      bool full;  //Flagged true if data contains 2 items                                         
 16                                                                                                  
 17      node23(); //Constructor                                                                   
 18      ~node23(); //Destructor                                                                      
 19                                                                                                 
 20 };             


This is my insert function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
node23* tree23::insert(node23*& root, int* to_add)
143 {
144      if(root == nullptr) //Base case- went too far
145           return nullptr;
146      node23 * x; //Catches ptr from returning function calls
147      bool isleaf = true; //Check if this is a leaf
148      for(int i = 0; i < 3; ++i)
149      {
150           if(root->child[i])
151                isleaf = false;
152      }                
153      if(!isleaf) //If not, we do all this stuff
154      { //Find the right child to pass to the next call
155           if(*to_add < *(root->data[0])) 
156                x = insert(root->child[0], to_add);
157           else if(*to_add > *(root->data[1]))
158                x = insert(root->child[2], to_add);
159           else
160                x = insert(root->child[1], to_add);
161 
162           if(x) //If we got back a node from the recursion
163           {
164                if(!root->full) //Then we check if root can absorb the incoming value
165                {//Find where it goes, place, and return null
166                     if(*(x->data[2]) > *(root->data[0]))
167                     {
168                          root->data[1] = x->data[2]; 
169                          root->full = true;
170                          return nullptr; ///Our work is done
171                     }
172                     else if(*(x->data[2]) < *(root->data[0]))
173                     {
174                          root->data[1] = root->data[0];
175                          root->data[0] = x->data[2];
176                          root->full = true;
177                          return nullptr;
178                     }
179                }
180                else //This node is full and we need to split it  
181                {//Sort the data so that data[2] holds the mid value
182                     if(*(x->data[2]) > *(root->data[1]))
183                     {
184                          root->data[2] = root->data[1];
185                          root->data[1] = x->data[2];
186                     }
187                     else if(*(x->data[2]) < *(root->data[0]))
188                     {
189                          root->data[2] = root->data[0];
190                          root->data[0] = x->data[2];
191                     }
192                     else
193                          root->data[2] = x->data[2];
194                     split(x);
195                     return root;
196                }
197           }
198      }
199      else //Root is a leaf- time to insert!
200      {
201           if(!root->full)
202           {
203                if(*to_add >= *(root->data[0]))
204                {
205                     root->data[1] = to_add;
206                     root->full = true;
207                     return nullptr;
208                }
209                else if(*to_add < *(root->data[0]))
210                {
211                      root->data[1] = root->data[0];
212                      root->data[0] = to_add;
213                      root->full = true;
214                      return nullptr;
215                }
216           }
217           else //Leaf has no room
218           { //Insert the data, placing mid value in data[2]
219                if(*to_add < *(root->data[0]))
220                {
221                     root->data[2] = root->data[0];
222                     root->data[0] = to_add;
223                }
224                else if(*to_add >= *(root->data[1]))
225                {
226                     root->data[2] = root->data[1];
227                     root->data[1] = to_add;
228                }
229                else
230                     root->data[2] = to_add;
231           }
232           return root; //The leaf has three items- push it up to be split!
233      }
234      cout << "Insert function got too far! \n";
235      return nullptr;
236 
237 }

And my split function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void tree23::split(node23*& root)
123 {    
124      if(root == nullptr)
125           return;
126      node23 * n_root = new node23;
127      n_root->data[0] = root->data[2]; //This will always be the mid value
128      n_root->child[0] = new node23;
129      n_root->child[2] = new node23;
130      n_root->child[0]->data[0] = root->data[0]; 
131      n_root->child[2]->data[0] = root->data[1]; //Place the data in the new child nodes
132      //Now set the child child nodes to root's children
133      n_root->child[0]->child[0] = root->child[0];
134      n_root->child[0]->child[2] = root->child[1];
135      n_root->child[2]->child[2] = root->child[2];
136      delete root;
137      root = n_root;
138      n_root = nullptr;
139      return;
140 }


My public insert method still needs a little work, and the compiler warns me that I need to look at my paths through the insert function (Hence the warning message at the end :D), but I *think* this will function, though it's probably really sloppy. I'm going to write a display function and a driver tomorrow and give it a test run; any feedback you have, I would definitely appreciate!
You may find this print_structure function useful.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <iostream>
#include <iomanip>

class Node23
{
    static Node23* const Unused;
    int     data[2];
    Node23* child[3];
    Node23* parent;

public:
    // New nodes are always 2-nodes
    Node23(int d, Node23* p=nullptr, Node23* a=nullptr, Node23* b=nullptr)
        : data{d, 0}, child{a, b}, parent{p} { child[2] = Unused; }

    // For manual tree-building we need to make 3-nodes
    Node23(int d0, int d1, Node23* p=nullptr,
           Node23* a=nullptr, Node23* b=nullptr, Node23* c=nullptr)
        : data{d0, d1}, child{a, b, c}, parent{p} { }

    ~Node23() {
        delete child[0]; // delete is okay with nullptr
        delete child[1];
        if (is3node()) delete child[2]; // but not with our special value
    }

    // "active" pointer is neither nullptr nor Unused.
    static bool active(Node23* node) { return node && node != Unused; }

    bool is3node() const { return child[2] != Unused; }
    bool is_leaf() const { return child[0] == nullptr; }

    bool expand(int d) {
        if (is3node())
            return false; // will become a 4-node; let caller handle it
        if (d < data[0]) {
            data[1] = data[0];
            data[0] = d;
        }
        else
            data[1] = d;
        child[2] = nullptr; // mark as 3-node (overwriting Unused value)
        return true;
    }

    void print(bool top = true) const {
        if (child[0]) child[0]->print(false);
        std::cout << data[0] << ' ';
        if (child[1]) child[1]->print(false);
        if (is3node()) {
            std::cout << data[1] << ' ';
            if (active(child[2])) child[2]->print(false);
        }
        if (top) std::cout << '\n';
    }

    void print_structure(int depth = 0) const {
        if (is3node()) {
            if (active(child[2])) child[2]->print_structure(depth + 1);
            std::cout << std::setw(depth * 4) << "" << data[1] << '\n';
        }
        if (child[1]) child[1]->print_structure(depth + 1);
        std::cout << std::setw(depth * 4) << "" << data[0] << '\n';
        if (child[0]) child[0]->print_structure(depth + 1);
    }
};

Node23* const Node23::Unused = reinterpret_cast<Node23*>(1);

int main() {

    //              10
    //    3 , 5             15
    //  1   4   6       12     18

    // Manually create a tree (with null parent pointers for now)
    auto a = new Node23(10,
                 nullptr,
                 new Node23(3, 5,
                     nullptr,
                     new Node23(1),
                     new Node23(4),
                     new Node23(6)),
                 new Node23(15,
                     nullptr,
                     new Node23(12),
                     new Node23(18)));

    a->print();
    
    // The structure printout shows the tree sideways.
    // Tilt your head to the left to view it. :)
    a->print_structure();
    
    delete a;
}


1 3 4 5 6 10 12 15 18 
        18
    15
        12
10
        6
    5
        4
    3
        1
Last edited on
Topic archived. No new replies allowed.