Understand of "quasi" external sort

How can I sort these 120 integers with only 20 integers at a time by using a "quasi" external sort? I don't understand how to code this.

CODE/DIRECTIONS:
Sort a file with 120 ints where only 20 ints maybe placed into memory at any
one time

general idea:
break data into blocks, sort blocks into runs, merge runs

less general idea:
Call inFile1 our source file ( the one with the initial 120 records to be sorted, an inFile2, and 2 other files outFile1 and outFile2.
Break the file into blocks of size 20: in this case, there will be 6 blocks
Sort the blocks alternative by
read a block, sort it, store in outFile1
read a block, sort it, store in outFile2
etc.
By definition, a run is a sorted block

Merge the runs
Merge data from outFile1 and outFile2 to inFile1 and inFile2.
Merge the first run on outFile1 and the first run on outFile2,
and store the result on inFile1:
Read two records in main memory, compare, store the smaller
on inFile1
Read the next record from either outFile1 or outFile2 the file
that had its record moved/stored to inFile1
Similarly, merge the second run on outFile1 and the second run on
outFile2, store the result on inFile2.
Merge the third run on outFile1 and the third run on outFile2, store
the result on inFile1... etc
merging each run and storing the result alternatively on inFile1
and inFile2.

At the end
inFile1 and inFile2 will contain sorted runs twice the size of the
previous runs on outFile1 and outFile2
Now merge data from inFile1 and inFile2 to outFile1 and outFile2.
Merge the first run on inFile1 and the first run on inFile2, and store
the result on outFile1.
Merge the second run on inFile1 and the second run on inFile2,
store the result on outFile2, Etc, merge and store alternatively on inFile1 and inFile2.
Repeat the process until only one run is obtained. This would be the sorted
file.
sort_algorithms.t

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
#ifndef SORT_ALGORITHMS_T__
#define SORT_ALGORITHMS_T__


#include <cstdio>
#include <cstdlib>
#include <fstream>
using std::fstream;
using std::streampos;

#include <iostream>
using std::ios;
using std::endl;



template <typename T>
void swap( T &a, T &b )
{
  T temp = a;

  a = b;
  b = temp;
}


template <typename T>
void selectSort( T* arrayptr, int arraySize,  bool ( *cmp )( T &baseData1, T &baseData2 ) )
{
  int smallindex; // index of smallest element in the sublist
  int pass, j;

  // pass has the range 0 to n-2
  for ( pass = 0; pass < arraySize - 1; pass++ )
    {
      // scan the sublist starting at index pass
      smallindex = pass;

      // j traverses the sublist arrayptr[pass+1] to aarrayptr[n-1]
      for ( j = pass + 1; j < arraySize; j++ )
        // update if smaller element found
        if ( cmp ( arrayptr[j], arrayptr[smallindex] ) )
          smallindex = j;

      // when finished,  exchange smallest item with arrayptr[pass]
      swap( arrayptr[pass], arrayptr[smallindex] );
    }
}

template <typename T>
void doubleSeletcSort( T* arrayptr, int arraySize,  bool ( *cmp )( T &baseData1, T &baseData2 ) )
{
  // index of smallest and largest elements in a sublist
  int smallIndex, largeIndex;
  int i, j, k;
  T temp;

  // start indices
  i = 0;
  j =  arraySize - 1;
  // as long as i < j
  while (i < j)
    {
      // scan the sublist {arrayptr[i], ..., arrayptr[j]}
      smallIndex = i;
      largeIndex = i;

      // k traverses the sublist {arrayptr[i+1], ..., arrayptr[j]}
      for (k = i+1; k <= j; k++)
        // update if smaller element found
        if (  cmp( arrayptr[k], arrayptr[smallIndex] )    )
          smallIndex = k;
      // update if larger element found
        else if (  cmp( arrayptr[largeIndex], arrayptr[k] )  )
          largeIndex = k;

      // if smallIndex and i are not the same location,
      // swap smallest item in the sublist with arrayptr[i]
      if (smallIndex != i)
        {
          swap( arrayptr[i], arrayptr[smallIndex] );

          // at index i, arrayptr[i] maybe largest element
          // if so, swap moves the largest value to index smallIndex
          if (largeIndex == i)
            largeIndex = smallIndex;
        }

      // if largeIndex and j are not the same location,
      // swap largest item in the sublist with arrayptr[j]
      if (largeIndex != j)
        {
          swap( arrayptr[j], arrayptr[largeIndex]  );
        }

      // move i forward and j backward
      i++;
      j--;
    }
}

template <typename T>
void insertionSort( T* arrayptr, int arraySize,  bool ( *cmp )( T &baseData1, T &baseData2 )   )
{
  int i, j;
  T target;

  // place arrayptr[i] into the sublist
  //   arrayptr[0] ... arrayptr[i-1], 1 <= i < n,
  // so it is in the correct position
  for (i = 1; i <  arraySize; i++)
    {
      // index j scans down list from arrayptr[i] looking for
      // correct position to locate target. assigns it to
      // arrayptr[j]
      j = i;
      target = arrayptr[i];
      // locate insertion point by scanning downward as long
      // as target < arrayptr[j-1] and we have not encountered the
      // beginning of the list
      while (j > 0 && target < arrayptr[j-1])
        {
          // shift elements up list to make room for insertion
          arrayptr[j] = arrayptr[j-1];
          j--;
        }
      // the location is found; insert target
      arrayptr[j] = target;
    }
}


template <typename T>
void bubbleSort( T* arrayptr, int arraySize,  bool ( *cmp )( T &baseData1, T &baseData2 ) )
{
  register int i,j;
  // index of last exchange
  bool did_swap = true;

  // i is the index of last element in the current sublist
  i = arraySize - 1;

  // continue the process until we make no exchanges or
  // we have made n-1 passes
  while ( i > 0 && did_swap )
    {
      // assume no exchange occurs
      did_swap = false;

      // scan the sublist arr[0] to arr[i]
      for ( j = 0; j < i; j++ )
        // exchange a pair and assign true to exchangeOccurs
        if ( cmp( arrayptr[j + 1], arrayptr[j] ) )
          {
            swap( arrayptr[ j ], arrayptr[ j+1 ] );
            did_swap = true;
          }
      // move i downward one element
      i--;
    }
}


template <typename T>
void basicBubbleSort( T* arrayptr, int arraySize,  bool ( *cmp )( T &baseData1, T &baseData2 ) )
{
  register int i,j;

  for ( i = 1; i < arraySize; i++ )
    {
      for ( j = arraySize - 1; j >= i; j-- )
        {
          if  ( cmp( arrayptr[ j ], arrayptr[ j - 1 ] )  )
            swap( arrayptr[ j -1 ] , arrayptr[ j ]   );
        }

    }

}



template <typename T>
void mergesort1(T* arrayptr, const int& arraySize )
{
  sortmerge1( arrayptr, arraySize, 0, arraySize - 1 );
}


template <typename T>
void sortmerge1( T* arrayptr, const int& arraySize, int l, int r )
{

  int mid, i, j, k;



  if ( l < r )
    {
      mid = (r + l)/2;

      sortmerge1( arrayptr, arraySize, l, mid );
      sortmerge1( arrayptr, arraySize, mid + 1, r );

      T* temp = new T[ arraySize ]; 

      for ( i = mid + 1; i > l; i-- )
        temp[ i - 1 ]= arrayptr[ i - 1 ];

      for ( j = mid; j < r; j++ )
        temp[ r + mid - j ] = arrayptr[ j + 1 ];

      for ( k = l; k <= r; k++)
        arrayptr[k] = ( temp[i] < temp[j] )  ?  temp[i++] : temp[j--];

    }


  temp = 0;
  delete [] temp;
}


template <typename T>
void msSort( T* arrayptr, const int& arraySize )
{

  T* copy = new T[ arraySize ];
  int i;
  for ( i =  0; i < arraySize; i++ )
    copy[i] = arrayptr[i];

  mergesort2( copy, arrayptr, 0, arraySize - 1 );

}


template <typename T>
void mergesort2( T* source, T* dest, int l, int r )
{

  if ( l != r )
    {
      int mid = ( l + r )/2;
      mergesort2( dest, source, l, mid );
      mergesort2( dest, source, mid + 1, r );
      merge2( source, dest, l, mid, r );
    }

}


template <typename T>
void merge2( T* source,  T* arrayptr , int l, int mid,  int r )
{

  int i = l;
  int j = mid + 1;
  int k = l;

  while ( ( i <= mid  ) && ( j <= r ) )   	// Compare current item from each list
    if ( source[ i ] < source[ j ]  )  		// Then i item comes first
      arrayptr[ k++ ] = source[ i++ ];
    else                                  	// j item comes first
      arrayptr[ k++ ] = source[ j++ ];
  						// Move what is left of remaining list
  if ( i > mid )
    while ( j <= r )
      arrayptr[ k++ ] = source[ j++ ];
  else
    while (i <= mid )
      arrayptr[ k++ ] = source[ i++ ];
      
     
}

#endif 
> I don't understand how to code this.
you were given quite detailed instructions, ¿what specific part you don't understand?

> Break the file into blocks of size 20
¿can you do that?
for simplicity, assume that the quantity of numbers is multiple of 20
so you end with block{1..6}

> read a block, sort it, store in outFile1
once you have read it, ¿do you know how to sort 20 numbers? any algorithm will do.
then write the sorted array to disk.
so you end with sorted_block{1..6}

> Merge the runs (...)
ok, that's confusing
I don't understand why it never talks about outFile{3..6}
or why it calls inFile to something it will be write to
or what's the first, second, third run, ¿what's a run?

I guess that it wants to reuse the files, but let's ignore that for now, optimize later.
you'll perform a merge operation in two files, and will obtain a sorted file with all the elements
merge(sorted_block1, sorted_block2) -> merged_12
merge(sorted_block3, sorted_block4) -> merged_34
merge(sorted_block5, sorted_block6) -> merged_56
merge(merged_12, merged_34) - > merged_1234
merge(merged_1234, merged_56) -> merged_123456
now merged_123456 has the elements of your input file in sorted order

for the merge
1
2
3
4
5
6
7
8
9
10
11
int a, b;
in_a >> a
in_b >> b
while in_a.has_data() and in_b.has_data():
   if a<b:
      output << a
      in_a >> a
   else:
      output << b
      in_b >> b
//output << whatever remains in in_a or in in_b  


for the optimization of data in disk
- inFile has the input
- when you divide it into block{1..6}, inFile is no longer needed
- when you sorted them into sorted_block{1..6}, block{1..6} are no longer needed
- after each merge, the input files are no longer needed (after merge(a, b) -> ab, you don't need a, b anymore)
Last edited on
1. For breaking a file into 120 blocks. I tried to make a 'for' loop for reading 20 integers and another 'for' loop for blocks assign these 20 numbers into a block. Doing all of that 6 times. I still don't know how to break them into blocks (each block can hold 20 integers).

2. For reading, sorting, and storing each block, it has to be alternatively stored to outfile1 and outfile2.

3. A run, by definition, is a sorted block. However, note that each file outFile1 and outFile2 will have half of the runs.

4. Merging part from two output files to two input files. And do the same thing as merging from two input files to two output files. I have to alternatively store two output files.I don't get this part at all.
do you understand the merge (not merge sort, just merge) algorithm?
if you have 0,2,4 and 1,2,3, you want 0,1,2,3,4 out. how do you do that, only looking at the top value from each list?
read each list once.
0, 1 you have. write the smallest value from this to new file and discard it. you have 1.
now read the file that had the value you threw away in it (gotta track this): you now have 1,2 write the smallest and discard.... over and over. If you finish off one file, write the one you have left over and then just read/write directly the rest of the other file until its done.
Last edited on
I haven't done the merge algorithm before.
I haven't done the merge algorithm before.

You have an example of it in the code you posted.
the point of such assignments is to do things you haven't done before.
just work from what you know about things.
you know a sorted list, that if you take the top number off, the next number can't be bigger than the one you just took off. Now double that, work off two lists, then what do you know? Use the facts you know to figure out the process... if you take 2 numbers off a pair of sorted list, like 5 and 10, then you KNOW that if you take a third number from the list that gave you 10, it can't be smaller than 5 or 10. All you know about the next number in the 5's list is that its >5. It could be 6 or it could be 23. If its 6, its < 10, and goes after 5, but if its 23, 10 is next. So that is why i said track which list the smallest thing that you save to a new file is from, and get the next value from that one. Try it on paper with 2 lists of 5 or so numbers each, sorted. Work out what I am saying to see if you agree that it works. Then you can code it; its simple code, the hard part is understanding what you have and how to use what you have to get what you want.

Sure, you can also just copy from the web or even your own program... but it would be better if you figure it out anew and write it again yourself, understanding it.

by the way, you can bubble sort a file easily. read the file one by one find smallest, write it to new file appended. Repeat, remembering the last value you found, looking for the smallest value >= to the one you just found (and if its ==, you have to know that its a new copy not the one you found last time, so you need to track that piece of info somehow, possibly just the file position). Its ugly, but it works, you never need more than 3 or 4 ints in memory at any one time. If it were a binary file of ints, it would be easy to treat that as an array, and not really that inefficient if you were on a near 0 memory wristwatch sized computer.
Last edited on
Ok, that's helpful. Thanks for the information.
Topic archived. No new replies allowed.