Finding a char array in a large file

I have a specific byte (that is unsigned char) array, that I need to find in a very big file (2GB or so), currently I'm using:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
size_t fsFind(FILE* device, byte* data, size_t size)
{
        int SIZE = (1024 > size) ? 1024 : size;
        byte buffer[SIZE];
 
        int pos = 0;
        int loc = ftell(device);
 
retry:
        while(fread(buffer, sizeof(byte), SIZE, device) == SIZE)
        {
                for(int i = 0; i < SIZE; i++)
                {
                        if(buffer[i] == data[0])
                        {
                                for(int j = 0; j < size; j++)
                                {
                                        if(i + j < 1024)
                                        {
                                                if(buffer[i + j] != data[j])
                                                        goto nofound;
                                        }
                                        else
                                        {
                                                int pos = ftell(device);
                                                pos -= j;
                                                fseek(device, pos, SEEK_SET);
                                                goto retry;
                                        }
                                }
                                fseek(device, loc, SEEK_SET);
                                return pos + i;
                        }
nofound:;
                }
                pos += SIZE;
        }
        fseek(device, loc, SEEK_SET);
        return -1;
}


Which seems to find proper result on first use, but on subsequent searches it seems to find invalid positions, is there something I'm doing wrong, or is there a better way?
Last edited on
Have you tried rewriting it in C++ instead of C?
Does your OS have the grep command? Use it if possible.
@LB:
I don't see how using ifstream::read would make any difference, I thought these were merely wrappers

@dhayden
wouldn't grep be slower?
wouldn't grep be slower?

No, grep (or fgrep) would probably be faster. There are incredibly clever algorithms for searching for one string inside another. The larger the search string, the faster they go.

You should be limited only by the speed of your disks. In other words. you should be able to search the data as fast as you can read it from the disk drive.
interesing, but I can't seem to find any docs on fgrep, except for the shell, also does it work with binary files?
Last edited on
Use a string search algorithm (instead of a pattern-matching algorithm) for this.

Boyer–Moore: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://www.boost.org/doc/libs/1_57_0/libs/algorithm/doc/html/algorithm/Searching.html

Something along these lines (off the cuff, untested):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <iostream>
#include <vector>
#include <string>
#include <boost/algorithm/searching/boyer_moore.hpp>

std::streamoff find( std::string substr, std::istream& stm /* opened in binary mode */ )
{
    std::size_t str_size = substr.size() ;
    if( str_size > 1024 ) str_size = 1024 ;
    boost::algorithm::boyer_moore<std::string::iterator> boyer_moore( substr.begin(), substr.begin() + str_size ) ;

    std::size_t buff_size = 1024 * 1024 * 32 ; // 32 MB
    std::vector<char> corpus(buff_size) ;
    char* begin = std::addressof( corpus.front() ) ;

    std::streamoff offset = stm.tellg() ;
    std::size_t nread = buff_size ;

    while( nread == buff_size )
    {
        stm.seekg(offset) ;
        stm.read( begin, buff_size );
        nread = stm.gcount() ;
        if( nread == 0 ) break ;

        char* end = begin + nread ;
        char* found = boyer_moore( begin, end ) ;
        if( found != end ) return offset + (found-begin) ;

        offset += (buff_size-str_size) ;
    }

    return std::streamoff(-1) ;
}
Topic archived. No new replies allowed.