Recursive Directory Scan: Can we use multi-threaading for speed?

In our Qt application, we have to load several files on application start-up starting from a root directory.

We are recursively scanning the directories and loading the files.

The total time it takes is about 12 seconds. Can we reduce this time if we use multi-threading?

I have heard that multi-threading does not work everywhere and it increases code complexity and debugging issues.

Would multi-threading solve the above problem? We want it to be platform independent (Mac, Linux, Windows).
Last edited on
In our Qt application, we have to load several files on application start-up starting from a root directory.
Maybe start searching from directories where you expect to find those files?
Or exclude directories where they should not be?

Would multi-threading solve the above problem?
Hint: hard disc drives cannot read two different areas at the same time.
It depends on what the bottlenecks are. If it's slow mainly because it's waiting for the data to be read from the hard drive you will not see much benefit using multiple threads.
Thanks for your responses!

The files are small component files residing in different "category" folders in hierarchical order. Each directory has atleast one file or a sub-directory. So unfortunately we can't skip any folder in the scan process.

1
2
3
4
5
6
7
8
root
  - Category 1
     - Cat 1.1
     - Cat 1.2
      ...
  - Category 2
  -
  -


I have tried to time the application, and I see that maximum time is consumed will trying to recurse the directory structure, not while reading the files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
void FileElementsCategory::reload() {

	// ... some code
	
	
	QStringList dirs = cat_dir.entryList(QStringList(), QDir::AllDirs | QDir::NoSymLinks | QDir::NoDotAndDotDot, QDir::Name);

        //Maximum time is consumed in this loop
	foreach(QString dir, dirs) {

                // this creates a new element category and calls its "reload" function again.
		categories_.insert(dir, new FileElementsCategory(cat_dir.absoluteFilePath(dir), this, file_parent_collection_));
	}
	
	// Process the file (does not take much time)
	QStringList elmts = cat_dir.entryList(QStringList("*.elmt"), QDir::Files, QDir::Name);
	foreach(QString elmt, elmts) {
		elements_.insert(elmt, new FileElementDefinition(cat_dir.absoluteFilePath(elmt), this, file_parent_collection_));
	}
}
Last edited on
@abhishekm71

Be VERY careful what you use multi-threading for. If you use multi-threading to load data, and you write it wrong, you can have a race-condition nightmare. I would suggest each thread load 1 file.

Another thing you can do is change the file-structure so that it loads, reads, or writes differently. In other words: you could have some sore of reading mechanism in place that allows you to skip to different areas in the file.
> We want it to be platform independent (Mac, Linux, Windows).

Use the standard C++ library.


> Would multi-threading solve the above problem?

With a truly re-entrant kernel, it may. Measure the performance for your specific use-cases.

This would be the general idea:

Single thread:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <vector>
#include <string>
#include <chrono>
#include <boost/filesystem.hpp>

int main()
{
    using namespace std::chrono ;
    const auto start = steady_clock::now() ;
    
    std::vector<std::string> file_list ; 
    
    auto begin = boost::filesystem::recursive_directory_iterator("/usr/local");
    const auto end = boost::filesystem::recursive_directory_iterator() ;
    unsigned long long cnt = 0 ;
    
    for( ; begin != end ; ++begin )
    {
        ++cnt ;
        const std::string path = begin->path().string() ;
        if( path.find( ".a" ) != std::string::npos ) file_list.push_back(path) ;
    }
    
    const auto dur = steady_clock::now() - start ;

    std::cout << cnt << " directory entries were searched\n" 
              << file_list.size() << " entries with '.a' in the path were added to the vector\n" 
              << "elapsed wall-clock time " << duration_cast<milliseconds>(dur).count() << " milliseconds.\n" ;
}

g++-4.8 -std=c++11 -O2 -Wall -Wextra -pedantic-errors -pthread main.cpp -lboost_system -lboost_filesystem && ./a.out
13923 directory entries were searched
226 entries with '.a' in the path were added to the vector
elapsed wall-clock time 108 milliseconds.

http://coliru.stacked-crooked.com/a/6caa6d81c8a0d14b

One thread per top-level sub-directory:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <iostream>
#include <vector>
#include <string>
#include <chrono>
#include <boost/filesystem.hpp>
#include <atomic>
#include <future>

std::atomic<unsigned int> nthreads{1} ;

std::vector<std::string> find_files( boost::filesystem::path p, std::string substr )
{
    ++nthreads ;
    std::vector<std::string> file_list ;
    
    auto begin = boost::filesystem::recursive_directory_iterator(p);
    const auto end = boost::filesystem::recursive_directory_iterator() ;

    for( ; begin != end ; ++begin )
    {
        const std::string path = begin->path().string() ;
        if( path.find( substr ) != std::string::npos ) file_list.push_back(path) ;
    }
    
    return file_list ;
}

int main()
{
    using namespace std::chrono ;
    const auto start = steady_clock::now() ;
    
    std::vector<std::string> file_list ; 
    
    auto begin = boost::filesystem::directory_iterator("/usr/local");
    const auto end = boost::filesystem::directory_iterator() ;
    
    std::vector< std::future< std::vector<std::string> > > futures ;
    for( ; begin != end ; ++begin ) 
    {
        const boost::filesystem::path p = begin->path() ;
        if( boost::filesystem::is_directory(p) ) 
           futures.emplace_back( std::async( std::launch::async, find_files, p, ".a" ) ) ;
        else
        {
            const std::string path_str = p.string() ;
            if( path_str.find( ".a" ) != std::string::npos ) file_list.push_back(path_str) ;
        }
    }
    
    for( auto& f : futures ) 
    {
        auto v = f.get() ;
        file_list.insert( file_list.end(), v.begin(), v.end() ) ;
    }
    
    const auto dur = steady_clock::now() - start ;

    std::cout << file_list.size() << " entries with '.a' in the path were added to the vector\n" 
              << "elapsed wall-clock time " << duration_cast<milliseconds>(dur).count() << " milliseconds.\n" 
              << "#threads: " << nthreads << '\n' ;
}

g++-4.8 -std=c++11 -O2 -Wall -Wextra -pedantic-errors -pthread main.cpp -lboost_system -lboost_filesystem && ./a.out
226 entries with '.a' in the path were added to the vector
elapsed wall-clock time 104 milliseconds.
#threads: 13

http://coliru.stacked-crooked.com/a/b4a07c10ebffec6c
Thanks a lot!

The second option does seem more complicated (though very interesting). Can you please explain what this line does:

futures.emplace_back( std::async( std::launch::async, find_files, p, ".a" ) ) ;

I guess I will first try to implement it with single thread. Then if required, I will try out the second option.

Thanks again!
> Can you please explain what this line does:
> futures.emplace_back( std::async( std::launch::async, find_files, p, ".a" ) ) ;

Breaking it down:
1.
1
2
std::async( std::launch::async /* execute in a separate thread*/, 
            find_files /* this callable object (function) */, p, ".a" /* with these arguments */ )

http://en.cppreference.com/w/cpp/thread/async

2. std::aync() returns a std::future<>; through which we can access the result of asynchronous operation (in this case, a vector of strings) later.
http://en.cppreference.com/w/cpp/thread/future

3. futures.emplace_back( std::async( ... ) ) ;
Save the returned std::future<> in a vector of futures for later use.
Moves the std::future<> into the vector; future is not CopyConstructible, but it is MoveConstructible

4. And later:
1
2
3
4
5
6
7
8
9
    for( auto& f : futures ) // for each future that was saved
    {
        auto v = f.get() ; // move the result (vector of strings) into v
                           // if the result is not yet available, wait for it to become available
                           // http://en.cppreference.com/w/cpp/thread/future/get
        
        // append the vector of strings returned by the async call to file_list
        file_list.insert( file_list.end(), v.begin(), v.end() ) ;
    }


Note: The code snippet makes use of the knowledge that /usr/local contains just a bunch of sub-directories.
Gr8! Thanks a ton JLBorges!
There's also the Qt way:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <QDir>
#include <QDebug>
#include <QDirIterator>
#include <ctime>
#include <chrono>

int main()
{
    auto start_time = std::chrono::steady_clock::now();
    QDirIterator it("/usr/lib",QDirIterator::Subdirectories);
    QStringList list;
    quint16 cnt(0);
    while (it.hasNext()) {
        it.next();
        ++cnt;
        if (QFileInfo(it.filePath()).isFile())
            if (QFileInfo(it.filePath()).suffix() == "a")
                list << it.filePath();
    }
    qDebug() << QString("Number of files/directories searched in %1 = %2").arg(it.path()).arg(cnt);
    qDebug() << QString("Number of files found = %2").arg(list.count());
    auto end_time = std::chrono::steady_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    qDebug() << elapsed.count() <<" milliseconds";
    return 0;
}

Output:
"Number of files/directories searched in /usr/lib = 34958"
"Number of files found = 625"
210 milliseconds

Topic archived. No new replies allowed.