how to sort the large size file content

Hi,
I have 2 GB file with 10 million lines. Is anyway to sort the file content without loading the full file content into memory
Divide your file in chunks.
Sort the chunks.
Merge the sorted chunks.
Considering you are in the linux/unix section.
The *nix sort command makes use of an external merge sorting algorithm. What it means is, that it will use less memory than is available regardless of the size the file might be that is being sorted. This has not been properly documented in wikipedia but here are some articles.

http://en.wikipedia.org/wiki/Merge_sort

http://en.wikipedia.org/wiki/External_sorting

You could use the source of "sort" or you call the command when needed. It will affect your code portability outside of the *nix family OSes.

eg.

sort unsorted_file > sorted_file

Very simple
Last edited on
Use a memory mapped file, you will "get" a contiguous piece of memory, sort that, operating system will take care of readingwriting to the file without loading whole file in memory. In *nix I think you need to use mmap() for this. (windows supports this too, but has differnt APIs)
Use a memory mapped file
That'll use the whole machine. But 2G on a machine is nothing these days, so it's probably the best method.

However, if the file is larger than the address space, you will need to use a different method though.
use this idea(called merged sort)

first divide it into small parts(like 1000 :D )
then just start sorting each of them.

now select two small file, load the first of both.
compare those,
call the smaller one tmp(or anything else)
now write tmp to answer file
now read next line from file who had tmp, and repeat until you reach EOF.

reapet it until you get all in one file

it use lots of time, but less memory.
I want to know what kind of program is 2 GB and 10million-so lines? Just curious...
It's not a program, it's a data file.
Topic archived. No new replies allowed.