Parsing a large file into smaller units

I have a large binary file (84GB) that needs to be broken down into smaller file sizes (~1GB to 8GB) for analysis. The binary file is on a 32-bit machine and cannot be copied to another machine for analysis. The PC has Visual Studio 6.0 and is not upgradable.
My issue is I'm using the following generic code to create the smaller files.

1
2
3
4
5
6
7
8
9
10
11
12
fseek(file, start, SEEK_SET);
end = start + (variable based on file size);
fseek(file, end, SEEK_SET);
	for (i=start; i<end; i++)
	{
		if(!feof(f))
		{
			byte = fgetc(f);
			fputc(byte,new_file);
		}
	}
 

However, on a 32-bit machine, the iterator can only count up to ~2billion. Which means that I'm unable to copy anything past ~2GB. My original idea was to delete from the large binary file as I read from it so that I can reset the iterator on every read. However, I haven't come across a way to delete binary file entries.

Is there any other way that to break down a large binary file into smaller units? Or is there a way to delete binary file entries in sections or per entry?
Any help will be appreciated.
On a 64-bit machine I could use _fseeki64. I've been reading that some versions of Visual 6.0 are capable of supporting 64-bit numbers but when using _fseeki64 or _lseeki64 on this machine its an "undeclared identifier"
That would be such an easy solution. Unfortunately I don't have access to a Linux based machine. I'm using Windows XP on this PC.
Last edited on
split is available under cygwin, and there are native win32 builds around.
You do know that the standard Windows file functions can handle huge files?

You don't even need to do quadword arithmetic. Just read the file sequentially, changing output files at some dword limit count.
I've not used the standard Windows file functions before so there will be a learning curve there. I've managed to be able to use _fseeki64 and _ftelli64 to get around the 2GB point. However, the fgetc, fputc, fread and fwrite are limited by the same 32bit values.
Is there a way to get around this? When my start = 0 and end = 0x200000000 (8GB) the output file is only ~1.3GB.
I'm assuming this is a limitation of the fgetc(), fputc()?
You know, unless you are trying to do something like preserve complete lines of text, why not use the split program like kbw suggests? It would make your life so much easier.
> I've not used the standard Windows file functions before so there will be a learning curve there.

They are fairly straightforward. Would look something like this (not tested):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <iostream>
#include <string>
#include <sstream>
#include <iomanip>

#ifndef _WIN32_WINNT
#define _WIN32_WINNT 0x0501 // XP
#endif // _WIN32_WINNT

#include <windows.h>

int main()
{
    const char* srce_path = "input_file.dat" ;
    const char* dest_path_prefix = "output_file" ;
    const char* dest_path_suffix = ".dat" ;
    // split filenames: output_file000.dat, output_file001.dat etc.

    // http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
    HANDLE infile = CreateFile( srce_path, GENERIC_READ, 0, 0, OPEN_EXISTING,
                                FILE_FLAG_SEQUENTIAL_SCAN, 0 ) ;
    if( infile != INVALID_HANDLE_VALUE )
    {
        // http://msdn.microsoft.com/en-us/library/windows/desktop/aa364957(v=vs.85).aspx
        LARGE_INTEGER li ;
        GetFileSizeEx( infile, &li ) ;
        const __int64 file_size = li.QuadPart ;


        const __int64 SPLIT_FILE_SZ = 1024*1024*1024 ; // 1 GB each

        // number of bytes to read and write at one go (factor of SPLIT_FILE_SZ)
        static const std::size_t CHUNKSZ = 1024*1024*64 ; // 64MB

        // http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx
        void* buffer = VirtualAlloc( 0, CHUNKSZ, MEM_COMMIT, PAGE_READWRITE ) ;

        for( __int64 pos = 0 ; pos < file_size ; pos += SPLIT_FILE_SZ )
        {
            std::string dest_path ; // form the dest file name
            {
                static int cnt = 0 ;
                std::ostringstream stm ;
                stm << dest_path_prefix << std::setw(3) << std::setfill('0')
                    << cnt++ << dest_path_suffix ;
                dest_path = stm.str() ;
            }
            HANDLE outfile = CreateFile( dest_path.c_str(), GENERIC_WRITE, 0, 0,
                                         CREATE_ALWAYS, FILE_FLAG_SEQUENTIAL_SCAN, infile ) ;
            if( outfile != INVALID_HANDLE_VALUE )
            {
                // read and write CHUNKSZ bytes at a time
                for( __int64 n = 0 ; n < SPLIT_FILE_SZ ; n += CHUNKSZ )
                {
                    DWORD read ;
                    // http://msdn.microsoft.com/en-us/library/windows/desktop/aa365467(v=vs.85).aspx
                    ReadFile( infile, buffer, CHUNKSZ, &read, 0 ) ;
                    // http://msdn.microsoft.com/en-us/library/windows/desktop/aa365747(v=vs.85).aspx
                    if(read) WriteFile( outfile, buffer, read, 0, 0 ) ;
                    if( read < CHUNKSZ ) break ; // last chunk
                }
                CloseHandle(outfile) ;
            }
        }

        CloseHandle(infile) ;
        VirtualFree( buffer, 0, MEM_RELEASE ) ;
    }
}
Topic archived. No new replies allowed.