Frequency counting in large arrays

Hi! Alright, so i'm just starting a data structures class, and we're studying arrays. We were asked to come up with a some code to match multiple entries from two arrays, and this is what I have:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
 
#include <iostream>
#include <array>
using namespace std;

int main() {
    array<int,5> arr = {5,8,2,5,3};
    array<int, 7> arr2 = {2,89,89,14,7,5,23};
    array<int, 100> f1;
    array<int, 100> f2;
    for (int i=0; i< arr.size(); i++){
      f1[arr[i]]++; //counter that increments for every unique element encountered 

    }

    for (int i=0; i<arr2.size(); i++){
        f2[arr2[i]]++;
    }

    for (int i=0; i<=100; i++){
        if ( f1[i] + f2[i] >= 2)
            cout << i << ": " << f1[i] + f2[i] << endl;

    }

    return 0;
}


My question is: how well would this work on an array that has, say, 300 elements? Or 1,000 elements? Are the quicker ways of doing something like this?
Hi,

for storing or inputting lets say 1000 elements you can use a file

http://www.cplusplus.com/reference/fstream/

the 1000 elements can be written in a file and taken

also if you want to sort/update/edit/work etc... them there's always sql

https://en.wikipedia.org/wiki/SQL

PS: welcome to cplusplus.com
Last edited on
My question is: how well would this work on an array that has, say, 300 elements? Or 1,000 elements?

What happens if one of arr or arr2 elements has a value larger than 100? It also appears that your f1 and f2 arrays are being used uninitialized.

We were asked to come up with a some code to match multiple entries from two arrays,

What exactly are you trying to match? Are you trying to find duplicate entries in each array, or are you trying to find common sets or subsets contained on both arrays.

so i'm just starting a data structures class,

Are there any limitations, for example are you able to use any of the standard containers?
@programmer007: Thanks!

@jlb: The program is just supposed to count the frequency of an integer in both of the arrays, then output the combined frequency if there is more than one occurrence of the integer. I fixed the code to reflect this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <array>
using namespace std;

int main() {
    array<int,7> arr = {5,89,2,5,3,89,};
    array<int, 10> arr2 = {2,5,89,2,89,14,7,5,23};
    array<int, 100> f1 = {0};
    array<int, 100> f2 = {0};
    for (int i=0; i<arr.size(); i++){
      f1[arr[i]]++; //counter that increments for every unique element encountered

    }

    for (int i=0; i<arr2.size(); i++){
        f2[arr2[i]]++;
    }

    for (int i=0; i<100; i++){
        if (f1[i] >= 2 && f2[i] >= 2){
            cout << i << ": " << f1[i] + f2[i] << endl;
          }
        }

    return 0;
}


I guess my question just boils down to how effective this algorithm would be for arrays that contain large amounts of elements. This is more just out of my curiosity, so there are no limitations as to what I can use in terms of containers/functions/etc.
This is more just out of my curiosity, soThis is more just out of my curiosity, so there are no limitations as to what I can use in terms of containers/functions/etc. there are no limitations as to what I can use in terms of containers/functions/etc.

Then perhaps you should look into using one or two of the standard containers. A std::map perhaps?

I guess my question just boils down to how effective this algorithm would be for arrays that contain large amounts of elements.

IMO, not very effective at all.

By the way you didn't answer the question I asked in the first paragraph in my last post. What happens if a number larger than the size of your f1 or f2 arrays is contained within your arr or arr2?

@jlb: Wouldn't having an element larger than the size of the frequency array cause a segmentation fault? So would something like this be more desirable?:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

int main() {
    vector<int> arr1 = {5,89,2,5,3,89,};
    vector<int> arr2 = {2,5,89,2,89,14,7,5,23};
    vector<int> f1 = {0};
    vector<int> f2 = {0};
    for (int i=0; i<arr1.size(); i++){
        f1[arr1[i]]++; //counter that increments for every unique element encountered

    }

    for (int i=0; i<arr2.size(); i++){
        f2[arr2[i]]++;
    }

    for (int i=0; i<100; i++){
        if (f1[i] >= 2 && f2[i] >= 2){
            cout << i << ": " << f1[i] + f2[i] << endl;
        }
    }

    return 0;
}
Wouldn't having an element larger than the size of the frequency array cause a segmentation fault?

Most likely.

So would something like this be more desirable?:

No, it's actually worse. Now the program will try to access an empty vector the first time you try to access the vector, which will probably crash the program the first time through your loops.

I still suggest that you investigate using a std::map.

The way you're currently trying to use the array or vector has several problems. First since you're trying to use the elements of the first two arrays as the index of the second two arrays you must insure you have enough elements to handle the highest value in your arrays, and don't forget that arrays start at zero and stop at size - 1. And even if you size your arrays or vectors correctly you will probably have quite a few elements that are "empty" which will slow down your searches.

Looking at your code you will need to have an array with the size of 90, and only 8 unique elements. So you have 82 elements that will have zero and 8 elements with values of one or more. Seems like a large waste of space to me.

Here's an example of code that jlb's is championing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <iostream>
#include <unordered_map>
#include <vector>


int main() {
    std::vector<int> seq_a = { 5, 89, 2, 5, 3, 89, };
    std::vector<int> seq_b = { 2, 5, 89, 2, 89, 14, 7, 5, 23 };

    struct count 
    { 
        unsigned contributor = 0; 
        std::size_t n = 0; 
    };

    std::unordered_map<int, count> counts;

    for (auto val : seq_a)
    {
        auto& c = counts[val];
        c.contributor |= 1;
        ++c.n;
    }

    for (auto val : seq_b)
    {
        auto& c = counts[val];
        c.contributor |= 2;
        ++c.n;
    }

    for (auto& pair : counts)
    {
        auto& c = pair.second;
        if ((c.contributor & 3) == 3)
            std::cout << pair.first << ": " << c.n << '\n';
    }
}


Although I would hesitate to call it "much more effective" since it has the same algorithmic complexity as the original code.

[edit: cut & paste cut off some includes]
Last edited on
Actually I was envisioning something like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <iostream>
#include <vector>
#include <map>


int main() {
    std::vector<int> seq_a = { 5, 89, 2, 5, 3, 89, };
    std::vector<int> seq_b = { 2, 5, 89, 2, 89, 14, 7, 5, 23 };

    std::map<int, int> num_occurances;

    // Count the number of times each time a number appears in seq_a.
    for (auto& val : seq_a)
    {
        num_occurances[val]++;
    }

    // Count the number of times each time a number appears in seq_b.
    for (auto& val : seq_b)
    {
          num_occurances[val]++;

    }
    
    // Print the map, showing the frequency of all the values in the arrays.
    for(auto& itr : num_occurances)
       std::cout << itr.first << " : " << itr.second << std::endl;

}


Output, showing all the values that appear in the arrays at least once.

1
2
3
4
5
6
7
2 : 3
3 : 1
5 : 4
7 : 1
14 : 1
23 : 1
89 : 4


The efficiencies come from only having 8 elements in the map, where in this case the vector would need to have 90 elements.

Actually I was envisioning something like:

Unfortunately, I don't think that satisfies the constraints of the problem (which were very poorly expressed.) Actually, I don't think mine does either, given the last loop in this post:
http://www.cplusplus.com/forum/general/193027/#msg929156

The constraints appear to be: "output the result if an element is present in both sequences an occurs more than one time in each sequence"

The constraint your code satisfies appears to be: "output the result if an element occurs in either sequence"

The constraint for mine: "output the result if an element is present in both sequences"

Perhaps the OP will chime in and clarify.
Unfortunately, I don't think that satisfies the constraints of the problem (which were very poorly expressed.)

I totally agree, my program most likely doesn't solve the required constraints. That is why I specified the constraints I used for the program.

However I think both programs show examples of different approaches that can be modified to solve the actual problem.

IMO the original approach tried by the OP is not feasible because it doesn't take into account numbers less than zero or very large positive values.

Perhaps the OP will chime in and clarify.

We can only hope.
Use an unordered_map<int,int> (H1).

Scan the first sequence (using a for loop). Increment the map[n] for every n in the first sequence. (This creates a histogram.)


Use another unordered_map<int,int> (H2).

Histogram the second sequence the very same way you did the first sequence.


Now iterate through every elt in H1. For every elt.second > 1, if elt.first is also a key in H2 where H2[elt.first].second > 1, print elt.first.


Your array<...> was a primitive form of map<int,int>, which works fine and is just as efficient time wise, but much less so in terms of space.

Hope this helps.
closed account (48T7M4Gy)
http://www.cplusplus.com/reference/algorithm/set_intersection/
Topic archived. No new replies allowed.