Couple of questions about my attempt at a "Benford's law" program

I'm attempting to write a program that reads in a data file, counts the first digit of the numbers in the data file, calculates their frequency, and also skips any comments. I've finally got something that I feel is somewhat close to working, but there are still some major issues. It seems to be reading in the file, and filtering out the comments, but it doesn't seem to be counting the first digit properly. It also for some reason doesn't filter out just one of the commented lines which seems odd to me. I'm fairly certain the method I'm trying to use to count the digits isn't something I can do, but I'm struggling with figuring out what method to use. I'd appreciate any help/comments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <fstream>
#include <stdlib.h>

using namespace std;


int main() {

    cout << "Enter the name of  the file to use, please use the specific path name with a forward slash instead of a back slash:" << endl;

    string text_file;

    getline (cin, text_file);


    vector<int> count(10);

    int temp = 0;

    double count_total = 0.0;

    ifstream infile{text_file.c_str()};


    string line;


       while(infile >> line){
           if(line[0] =='(' && line[1] == '*'){
               while (infile >> line) {
                   if(line[0] == '*' && line[1] == ')' ){
                      infile >> line;
                       break;
                   }
               }
           }
           else if (isdigit(line[0])){
               temp = (line[0] - '0');

               count.at(temp)++;
           }
        cout << line << endl;
    }



    infile.close();

    for(unsigned int i = 1; i <= 9; i++){
        count_total = count_total + count.at(i);
    }

    cout << "Digit      Count       Frequency" << endl;

    for(unsigned int i = 1; i <= 9; i++ ){
        cout << "   " << i << "         " << count.at(i) << "           " << count.at(i)/count_total << endl;
    }



    //copy(begin(count), end(count), ostream_iterator<int>(cout, " "));
    return 0;
}


Here is the data I'm currently testing it with.

(* Sunspot data collected by Robin McQuinn from *)
(* http://sidc.oma.be/html/sunspot.html *)

(* Month: 1749 01 *) 58
(* Month: 1749 02 *) 63
(* Month: 1749 03 *) 70
(* Month: 1749 04 *) 56
(* Month: 1749 05 *) 85
(* Month: 1749 06 *) 84
(* Month: 1749 07 *) 95
(* Month: 1749 08 *) 66
(* Month: 1749 09 *) 76
(* Month: 1749 10 *) 76
(* Month: 1749 11 *) 159
(* Month: 1749 12 *) 85
(* Month: 1750 01 *) 73
(* Month: 1750 02 *) 76
(* Month: 1750 03 *) 89
(* Month: 1750 04 *) 88
(* Month: 1750 05 *) 90
(* Month: 1750 06 *) 100
(* Month: 1750 07 *) 85
(* Month: 1750 08 *) 103
(* Month: 1750 09 *) 91
(* Month: 1750 10 *) 66
(* Month: 1750 11 *) 63
(* Month: 1750 12 *) 75
(* Month: 1751 01 *) 70
(* Month: 1751 02 *) 44
(* Month: 1751 03 *) 45
(* Month: 1751 04 *) 56
(* Month: 1751 05 *) 61
(* Month: 1751 06 *) 51
(* Month: 1751 07 *) 66
(* Month: 1751 08 *) 60
(* Month: 1751 09 *) 24
(* Month: 1751 10 *) 23
(* Month: 1751 11 *) 29
(* Month: 1751 12 *) 44
(* Month: 1752 01 *) 35
(* Month: 1752 02 *) 50
(* Month: 1752 03 *) 71
(* Month: 1752 04 *) 59
(* Month: 1752 05 *) 60
(* Month: 1752 06 *) 40
(* Month: 1752 07 *) 78
(* Month: 1752 08 *) 29
(* Month: 1752 09 *) 27
(* Month: 1752 10 *) 47
(* Month: 1752 11 *) 38
(* Month: 1752 12 *) 40
(* Month: 1753 01 *) 44
(* Month: 1753 02 *) 32
(* Month: 1753 03 *) 46
(* Month: 1753 04 *) 38
(* Month: 1753 05 *) 36
(* Month: 1753 06 *) 32
(* Month: 1753 07 *) 22
(* Month: 1753 08 *) 39
(* Month: 1753 09 *) 28
(* Month: 1753 10 *) 25
(* Month: 1753 11 *) 20
(* Month: 1753 12 *) 7
(* Month: 1754 01 *) 0
(* Month: 1754 02 *) 3
(* Month: 1754 03 *) 2
(* Month: 1754 04 *) 14
(* Month: 1754 05 *) 21
(* Month: 1754 06 *) 27
(* Month: 1754 07 *) 19
(* Month: 1754 08 *) 12
(* Month: 1754 09 *) 8
(* Month: 1754 10 *) 24
(* Month: 1754 11 *) 13
(* Month: 1754 12 *) 4
(* Month: 1755 01 *) 10
(* Month: 1755 02 *) 11
(* Month: 1755 03 *) 7
(* Month: 1755 04 *) 7
(* Month: 1755 05 *) 0
(* Month: 1755 06 *) 0
(* Month: 1755 07 *) 9
(* Month: 1755 08 *) 3
(* Month: 1755 09 *) 18
(* Month: 1755 10 *) 24
(* Month: 1755 11 *) 7
(* Month: 1755 12 *) 20
(* Month: 1756 01 *) 13
(* Month: 1756 02 *) 7
(* Month: 1756 03 *) 5
(* Month: 1756 04 *) 9
(* Month: 1756 05 *) 13
(* Month: 1756 06 *) 13
(* Month: 1756 07 *) 4
(* Month: 1756 08 *) 6
(* Month: 1756 09 *) 12
(* Month: 1756 10 *) 14
(* Month: 1756 11 *) 17
(* Month: 1756 12 *) 9
(* Month: 1757 01 *) 14
(* Month: 1757 02 *) 21
If I change the vector<int> count(10) into an array (which I don't necessarily want to do) it counts 1 instance of 2 and 8 instances of 8, which I suppose is an improvement on counting nothing. When I print out the string that's being evaluated after I filter out the comments, that part seems to be working correctly except for not filtering out the line:
(* http://sidc.oma.be/html/sunspot.html *)

Ok I think I've gotten a step closer, as It now counts the occurrences of the first digit and calculates the frequency successfully.

What I'm still not sure of, is why my code for filtering out the comments is working sometimes, but not always. It seems to be filtering out every comment except for the one I mentioned above with the type of data format I've posted.

If I try another file with different comments such as the following, it is filtering out the first, third, and fifth line, but not the second and fourth ones.:

(* LiveJournal data collected by Shirley Man from *)
(* http://www.livejournal.com/stats/stats.txt *)
(* Number of new accounts on LiveJournal, *)
(* day by day from 2000/1/1 to 2005/2/28 *)
(* Individual data are NOT labelled. *)

10
4
12
5
7

5
15
23
56



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <fstream>
#include <stdlib.h>
#include <iomanip>

using namespace std;


int main() {

    cout << "Enter the name of  the file to use, please use the specific path name with a forward slash instead of a back slash:" << endl;

    string text_file;

    getline (cin, text_file);


    vector<int> count(10);

    double count_total = 0.0;

    ifstream infile{text_file.c_str()};


    string line;


    while(infile >> line){
        if(line.at(0) =='(' && line.at(1) == '*'){
            while (infile >> line) {
                if(line.at(0) == '*' && line.at(1) == ')' ){
                    infile >> line;
                    break;
                }
            }
        }
        if (isdigit(line[0])){
            count.at(line.at(0) - '0')++;


        }

         cout << line << endl;
    }



    infile.close();

    for(unsigned int i = 1; i <= 9; i++){
        count_total = count_total + count.at(i);
    }

    cout << "Digit      Count       Frequency" << endl;

    for(unsigned int i = 1; i <= 9; i++ ){
        cout << "  " << i << "         " << count.at(i) << "           " << fixed << setprecision(2) << count.at(i)/count_total <<  endl;
    }



    //copy(begin(count), end(count), ostream_iterator<int>(cout, " "));
    return 0;
}
You won't successfully process two comments that come back to back. Consider what happens when you reach the end of the first comment:
- line 34 detects the end of the comment.
- line 35 reads the next word in the file, which is "(*"
- Line 40 skips it.
- Line 46 prints it out
- Back at line 31, you read the next word, which is the first word of the second comment.

You can fix it like this;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    while (infile >> line) {
        if (line.at(0) == '(' && line.at(1) == '*') {
            while (infile >> line) {
                if (line.at(0) == '*' && line.at(1) == ')') {
                    break; // note. Don't read the next word.
                }
            }
            continue;
        }
        if (isdigit(line[0])) {
            count.at(line.at(0) - '0')++;

        }

        cout << line << endl;
    }


Your comment detection code assumes that the comments are always surrounded by whitespace. (*This is a comment*) won't work because the last "line" read will be "comment*)".


Thank you that was very helpful. It appears I need to improve on tracing through exactly what my programs are doing line by line, as now that you explain it like that it seems completely obvious.
Last edited on
It appears I need to improve on tracing through exactly what my programs are doing line by line

Exactly. Programming requires that you be extremely specific - much more specific than when communicating with people. It's a type of thinking that doesn't come naturally to most people.

Learn to use the debugger. It will let you step through the program line by line and see what's going on. That can really help you find and fix bugs.
Topic archived. No new replies allowed.