Statistical groupings in data

I have data files that I sort through each column and analyze. I have a program that imports .csv files, reads them, and gives me a total average but I'd like it to find the groups of 1's or 0's. Here's an example:

-0.321513 -3.31541 40.4054 43.6341 61.2809 45.0144 -2.9939 3.22868 -16.2665 1
-0.439189 -1.81366 35.9093 43.7601 45.1267 43.0331 -1.37447 7.85073 -2.09354 1
-0.509661 -0.624818 72.5532 70.8309 66.2656 66.9575 -0.115158 -1.72222 0.6918 0
-0.544503 -3.81871 64.3498 66.2933 70.8058 59.0786 -3.2742 1.94344 -11.7272 1
-0.671295 -0.175805 35.6315 49.8103 60.4154 63.7496 0.495489 14.1788 3.33419 0
-0.7236 -1.51911 68.0611 85.547 74.4077 70.7544 -0.795514 17.486 -3.65325 1
-0.783681 -0.254386 64.0422 69.0418 60.075 59.3475 0.529295 4.99957 -0.72747 1
-0.803735 -1.25944 54.5853 59.6619 55.9453 55.4728 -0.455708 5.07659 -0.47252 1
-0.881432 -2.51315 72.5712 62.8337 61.8619 55.6253 -1.63172 -9.73754 -6.23667 0
-1.05064 -2.53533 42.8778 51.6764 71.0452 55.6805 -1.4847 8.79854 -15.3647 1
-1.13961 -1.24645 31.4673 34.0486 40.9442 37.5082 -0.106843 2.58132 -3.436 1
-1.26901 -1.15085 54.7223 58.9615 56.8615 62.0377 0.118157 4.23914 5.17621 1
-1.2718 -0.15924 73.5444 75.4138 74.1406 77.9504 1.11256 1.8694 3.80979 0
-1.37387 -1.38522 59.3103 62.8116 76.7923 72.4399 -0.011359 3.50124 -4.35236 0
-1.40155 -2.73804 35.3245 40.9177 56.0745 48.5125 -1.33649 5.59314 -7.56195 1
-1.40275 -4.61204 52.9487 54.1711 52.7928 40.8785 -3.20929 1.22237 -11.9143 0
-1.64058 -4.78766 40.8223 26.2052 56.7414 42.1925 -3.14708 -14.6172 -14.5488 1
-1.64741 -0.987463 65.3714 70.6175 66.3515 65.2419 0.659952 5.24606 -1.10963 0
-1.65478 -2.81474 69.3575 71.4106 70.0456 66.8539 -1.15996 2.05304 -3.19165 1
-1.6621 -1.32968 68.5087 67.7369 62.7819 62.2135 0.332421 -0.7718 -0.56841 0
-1.72677 -1.03044 51.1376 65.8048 69.6936 69.8582 0.696329 14.6673 0.16457 0
-1.77181 -3.91872 41.4588 45.2993 54.1688 41.9548 -2.14691 3.84049 -12.2139 1
-2.13617 -0.756203 34.4823 49.9223 57.3266 65.3138 1.37997 15.4401 7.98721 1
-2.22578 -1.19136 46.1496 56.6722 71.2347 70.6149 1.03441 10.5226 -0.61982 1
-2.35969 -0.0405831 68.3096 69.6786 72.4597 79.8935 2.31911 1.36898 7.43388 0


At the end of each row there is a 1 or 0 sorted in descending order of the first column. I'd like to identify where there is, for example, at least 10 instances with a percentage of at least 70%.

Here is my code so far:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <iostream>
#include <vector>
#include <algorithm>
#include <fstream>

using namespace std;

struct data //Create columns to store data in each row.
{
    float macB1, mac, stochB1, stoch, rB1, r, macMove, stochMove, rMove, matchCol;
};


void printData(data data)
{
    cout << data.macB1 << " " << data.mac << " " << data.stochB1 << " " << data.stoch << " " << data.rB1 <<
        " " << data.r << " " << data.macMove << " " << data.stochMove << " " << data.rMove << " " <<
        data.matchCol << endl;
}

int main()
{
    vector<data> antiSixth; //Create vector that will store data in rows.

    float macB1, mac, stochB1, stoch, rB1, r, macMove, stochMove, rMove, matchCol;
    char delim;

    ifstream myFile ("C:/Users/zachm/Documents/Anti Sixth Seal (CSV).csv"); //Opens the file to read and manipulate data.

    if (myFile.is_open())
    {
        cout << "The file has been opened.\n" << endl;

        float totalInstance(0), totalMatch(0), totalAvg(0);

        while (myFile >> macB1 >> delim >> mac >> delim >> stochB1 >> delim >> stoch >> delim >>
               rB1 >> delim >> r >> delim >> macMove >> delim >> stochMove >> delim >> rMove >>
               delim >> matchCol)
        {
            if (matchCol == 1)
            {
                totalInstance++;
                totalMatch++;
            }

            else if (matchCol == 0)
            {
                totalInstance++;
            }

            antiSixth.push_back({macB1, mac, stochB1, stoch, rB1, r, macMove, stochMove, //Creates each row and places the value in each column.
                                rMove, matchCol});
        }

        totalAvg = (totalMatch / totalInstance);

        sort (antiSixth.begin(), antiSixth.end(),
              [] (const data& a, const data& b)
              {
                  return a.macB1 > b.macB1;
              }
              );

        for (unsigned int i(0); i < antiSixth.size(); i++)
        {
            printData(antiSixth.at(i));
        }

        cout << "\nTotal number of instances: " << totalInstance << endl;
        cout << "Total number of matches: " << totalMatch << endl;
        cout << "Match %: " << totalAvg << endl;

    }

    else
    {
        cout << "The file was not opened." << endl; //Let's me know when the file was not opened.
    }


    return 0;
}
Last edited on
sorry, try again:

At the end of each row there is a 1 or 0 sorted in descending order of the first column. I'd like to identify where there is, for example, at least 10 instances with a percentage of at least 70%.

what/where is a percentage? Column 1? column 5?
Are 10 instances 10 rows? If so, is this related to how it is sorted?
you want to find groups of 1 and 0, what does that mean?

are you looking for 10 rows with a 1 on the end?



Apologies. Allow me to clarify.

In the matrix above the code, not arranged properly to visualize sorry about that, but there is a "1" or "0" at the end of every row. The first column is sorted in descending order. I would like to find where there are large concentrations of 1's.

For example, in the 5th-14th rows, in the last column, the sequence is "1, 1, 1, 0, 1, 1, 1, 0, 0, 1". That is at least 10 rows with a total average of 0.7. How would I write the code to identify which numbers in the respective first column that the range is associated with? In this case it would be, in the first column "-0.7236 to -1.40155" has a total count of 10 rows with an average of 0.7 according to the last column.

If that makes sense, any help would be appreciated.
1
2
3
4
5
6
7
8
9
10
void groupdata( const vector<data> &rows, int number  )
{
   for ( int startRow = 0; startRow < rows.size() - number + 1; startRow++ )
   {
      int matching = 0;
      for ( int r = startRow; r < startRow + number; r++ ) matching += rows[r].matchCol;
      cout << "Range: " << rows[startRow].macB1 << " to " << rows[startRow+number-1].macB1 << "     "
           << "Match factor = " << (float)matching / number << endl;
   }
}


Call with (after sorting):
groupdata( antiSixth, 10 );

Produces output (for now - adapt to your own needs):
The file has been opened.

-0.321513 -3.31541 40.4054 43.6341 61.2809 45.0144 -2.9939 3.22868 -16.2665 1
-0.439189 -1.81366 35.9093 43.7601 45.1267 43.0331 -1.37447 7.85073 -2.09354 1
-0.509661 -0.624818 72.5532 70.8309 66.2656 66.9575 -0.115158 -1.72222 0.6918 0
-0.544503 -3.81871 64.3498 66.2933 70.8058 59.0786 -3.2742 1.94344 -11.7272 1
-0.671295 -0.175805 35.6315 49.8103 60.4154 63.7496 0.495489 14.1788 3.33419 0
-0.7236 -1.51911 68.0611 85.547 74.4077 70.7544 -0.795514 17.486 -3.65325 1
-0.783681 -0.254386 64.0422 69.0418 60.075 59.3475 0.529295 4.99957 -0.72747 1
-0.803735 -1.25944 54.5853 59.6619 55.9453 55.4728 -0.455708 5.07659 -0.47252 1
-0.881432 -2.51315 72.5712 62.8337 61.8619 55.6253 -1.63172 -9.73754 -6.23667 0
-1.05064 -2.53533 42.8778 51.6764 71.0452 55.6805 -1.4847 8.79854 -15.3647 1
-1.13961 -1.24645 31.4673 34.0486 40.9442 37.5082 -0.106843 2.58132 -3.436 1
-1.26901 -1.15085 54.7223 58.9615 56.8615 62.0377 0.118157 4.23914 5.17621 1
-1.2718 -0.15924 73.5444 75.4138 74.1406 77.9504 1.11256 1.8694 3.80979 0
-1.37387 -1.38522 59.3103 62.8116 76.7923 72.4399 -0.011359 3.50124 -4.35236 0
-1.40155 -2.73804 35.3245 40.9177 56.0745 48.5125 -1.33649 5.59314 -7.56195 1
-1.40275 -4.61204 52.9487 54.1711 52.7928 40.8785 -3.20929 1.22237 -11.9143 0
-1.64058 -4.78766 40.8223 26.2052 56.7414 42.1925 -3.14708 -14.6172 -14.5488 1
-1.64741 -0.987463 65.3714 70.6175 66.3515 65.2419 0.659952 5.24606 -1.10963 0
-1.65478 -2.81474 69.3575 71.4106 70.0456 66.8539 -1.15996 2.05304 -3.19165 1
-1.6621 -1.32968 68.5087 67.7369 62.7819 62.2135 0.332421 -0.7718 -0.56841 0
-1.72677 -1.03044 51.1376 65.8048 69.6936 69.8582 0.696329 14.6673 0.16457 0
-1.77181 -3.91872 41.4588 45.2993 54.1688 41.9548 -2.14691 3.84049 -12.2139 1
-2.13617 -0.756203 34.4823 49.9223 57.3266 65.3138 1.37997 15.4401 7.98721 1
-2.22578 -1.19136 46.1496 56.6722 71.2347 70.6149 1.03441 10.5226 -0.61982 1
-2.35969 -0.0405831 68.3096 69.6786 72.4597 79.8935 2.31911 1.36898 7.43388 0

Total number of instances: 25
Total number of matches: 15
Match %: 0.6
Range: -0.321513 to -1.05064     Match factor = 0.7
Range: -0.439189 to -1.13961     Match factor = 0.7
Range: -0.509661 to -1.26901     Match factor = 0.7
Range: -0.544503 to -1.2718     Match factor = 0.7
Range: -0.671295 to -1.37387     Match factor = 0.6
Range: -0.7236 to -1.40155     Match factor = 0.7
Range: -0.783681 to -1.40275     Match factor = 0.6
Range: -0.803735 to -1.64058     Match factor = 0.6
Range: -0.881432 to -1.64741     Match factor = 0.5
Range: -1.05064 to -1.65478     Match factor = 0.6
Range: -1.13961 to -1.6621     Match factor = 0.5
Range: -1.26901 to -1.72677     Match factor = 0.4
Range: -1.2718 to -1.77181     Match factor = 0.4
Range: -1.37387 to -2.13617     Match factor = 0.5
Range: -1.40155 to -2.22578     Match factor = 0.6
Range: -1.40275 to -2.35969     Match factor = 0.5



There are quite a few issues with your code. I would baulk at lines like
void printdata(data data)
Try to avoid calling the variable the same as your struct (e.g. call the struct Data with a capital D).

It would have been helpful if you had provided an actual CSV file to run with.
Ok. That's very interesting. I'll definitely need to adapt but this is what I was looking for. Apologies for the code structure. Just started programming like 2 weeks ago but I feel like I'm catching on. Any other tips for the code?
Any other tips for the code?


- Simplify it (e.g. lines 40 - 49).
- Use double rather than float for accuracy.
- Counting variables should be integers.
- In your comments don't state the obvious! ("cout << "The file was not opened." << endl; //Let's me know when the file was not opened. "
- Turn up warnings and error messages in your compiler.
- Reiterate: don't name a variable (data) the same as a type (also, data).
> Any other tips for the code?

If the data set is very large, avoid the nested inner loop.
(If the count of 1's for vec[15] to vec[24] is already available,
getting the the count of 1's for vec[16] to vec[25] requires only an addition and a subtraction).
It is pretty large. And I'm not entirely sure what you mean.
Let us say that the last columns of antiSixth[15] to antiSixth[25] are 0 0 1 1 1 0 1 1 0 1 1

Count of the number of ones in antiSixth[15] to antiSixth[24] is 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 == 6

Count of the numbes of ones in antiSixth[16] to antiSixth[25] is 6 - 0 + 1 == 7
ie. previous count (6) - last col of antiSixth[15] (0) + last col of antiSixth[25] (1)
if large, pre-allocate that vector. Push-back is smart enough to attempt to pre-allocate but it isnt as smart as YOU about your data, so you can decrease the number of resize/copy hits in push-back if you can make the initial size bigger than the amount of data you expect.

Topic archived. No new replies allowed.