Hash table for 2-sum problem

Hi everyone,
2-sum problem : find 2 numbers (x,y) in an array so that x+y equal to given t.
I have solved this problem by sorting then use binary search, but it is very slow,(array has 1 million elements and each one is really large, some have ten digits)
The lecturer suggest using hash table, but I am confused when using map and unordered_map (I have no idea to deal with collision because the size of array is large, it is likely that more elements have the same key).
Can anyone help me to set up the initial data structure?

Thanks.
Last edited on
Any restrictions on x and y? Can they be negative? How exactly fast you need it to be?

for default std::set generating 1000000 numbers from -1000000000000 to 1000000000000 and finding number of combination for 100 different t took 34 seconds. If I restrict range of elements to [-1000000; 1000000], it takes 24 seconds.
The largest number is 99999662302 , the negative one is nearly the same number of digits.
(I store them in long long type )
20001 different number of t ( [-10000,10000]), my program took 20 min.
This is the program assignment in "Algorithms: Design and Analysis, Part 1" in Coursera.
I got the right answer but it seems too slow, so I want to improve it.
Last edited on
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <iostream>
#include <unordered_map>
#include <random>
#include <ctime>

std::vector<int> make_data_set( int minv = 1'000'000, int maxv = 2'000'000'000, std::size_t N = 1'000'000 )
{
    static std::mt19937 rng( std::time(nullptr) ) ;
    std::uniform_int_distribution<int> distrib( minv, maxv ) ;
    std::vector<int> result ;
    while( result.size() < N ) result.push_back( distrib(rng) ) ;
    return result ;
}

constexpr std::pair< std::size_t, std::size_t > NOT_FOUND { -1U, -1U } ;

std::pair< std::size_t, std::size_t > find_matching_pair( const std::vector<int>& seq, int sum )
{
    std::unordered_map< int, std::size_t > map ;
    for( std::size_t i = 0 ; i < seq.size() ; ++i )
    {
        const auto iter = map.find( sum - seq[i] ) ;
        if( iter != map.end() ) return { i, iter->second } ;
        else map[ seq[i] ] = i ;
    }
    return NOT_FOUND ;
}

int main()
{
    const auto data = make_data_set() ;
    const auto p = find_matching_pair( data, 200'000'000 ) ; // sums up to 200000000
    if( p != NOT_FOUND )
    {
        std::cout << "found: " << data[p.first] << " at position " << p.first << " &&\n"
                  << data[p.second] << " at position " << p.second << '\n'
                  << data[p.first] << " + " << data[p.second] << " == " << data[p.first] + data[p.second] << '\n' ;
    }
} 

clang++ -std=c++14 -stdlib=libc++ -O3 -Wall -Wextra -pedantic-errors -pthread main.cpp -lsupc++ && time ./a.out
found: 139871013 at position 104959 &&
60128987 at position 90612
139871013 + 60128987 == 200000000

real	0m0.209s
user	0m0.168s
sys	0m0.036s

http://coliru.stacked-crooked.com/a/2a79def4c4cab277



long long, numbers up to 16 digits

clang++ -std=c++14 -stdlib=libc++ -O3 -Wall -Wextra -pedantic-errors -pthread main.cpp -lsupc++ && time ./a.out
found: 431042883 at position 101279 && -431042883 at position 50819
431042883 + -431042883 == 0

found: 1583284879 at position 172514 && -583284879 at position 168676
1583284879 + -583284879 == 1000000000

found: 1290191863 at position 136047 && 709808137 at position 82770
1290191863 + 709808137 == 2000000000

found: 1441703358 at position 112879 && 1558296642 at position 18134
1441703358 + 1558296642 == 3000000000

did not find numbers summing up to 4000000000

real	0m1.875s
user	0m1.704s
sys	0m0.164s

http://coliru.stacked-crooked.com/a/b3d419f2e93e41b6
Last edited on
I think I found your assigment. You have sorted array of numbers and you need to output how many number from some range can be represented as sum of values in array, right?

Here is my take on this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <algorithm>
#include <ctime>
#include <iostream>
#include <random>
#include <vector>

long long random_number()
{
    const long long min = -100'000'000'000;
    const long long max =  100'000'000'000;
    static std::mt19937_64 gen(std::time(nullptr));
    static std::uniform_int_distribution<long long> dist(min, max);
    return dist(gen);
}

bool is_sum_exist(const std::vector<long long>& data, long long x)
{
    auto j = std::lower_bound(data.begin(), data.end(), x)  -  data.begin();
    decltype(j) i = 0;
    if(j == data.size()) --j;
    while (i < j)  {
        long long sum = data[i]+data[j];
             if (sum == x)  return true;
        else if (sum >  x)  --j;
        else                ++i;
    }
    return false;
}

int main()
{
    const std::size_t data_size = 1'000'000;
    const std::size_t requests = 100000;
    std::cin.sync_with_stdio(false); //In case I need to output actual numbers

    //creating dataset
    std::vector<long long> data(data_size);
    std::generate_n(data.begin(), data.size(), random_number);
    std::sort(data.begin(), data.end());

    //creating values we will search sum of
    std::vector<long long> rand_values(requests);
    std::generate_n(rand_values.begin(), rand_values.size(), random_number);
    std::cout << "done" << std::endl;

    long long counter = 0;
    for(auto t: rand_values) {
        if(is_sum_exist(data, t))
            ++counter;
    }
    std::cout << counter;
}
done
21637
Process returned 0 (0x0)   execution time : 75.488 s
75 seconds for 10 000 searches. Your perfomance might differ depending on your PC specs. Threading will really help here too.
EDIT: made local code highlighter happy
Last edited on
Thanks for your helps.
Topic archived. No new replies allowed.