Extracting email addresses from text file

Hi! This program suppose to extract email addresses from random texts
However I just can't get it to work, it outputs avocado every time, that if statement had never been true... Can't figure it out... I'll be really grateful if anybody can help me with this... Thanks!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
int main()
{
  //Data
  bool enterKey = false; //if user press enter key
  string iFileName; //input file name
  string dFileName = "fileContainingEmails.txt"; //default file name
  string oFileName; //output file name
  string lineFromFile; //text
  int i = 0; //loop index
  int s = 0; //loop index
  int e = 0; //loop index

  //instruction
  cout << "If nothing entered, the default file name will be used.\n\n";

  //get i/o file name
  cout << "Enter input filename [default: fileContainingEmails.txt]: ";
  getline(cin, iFileName);
  if (iFileName.length() == 0)
    enterKey = true;
  if (enterKey == true)
    {
    iFileName = dFileName;
    dFileName = "copyPasteMyEmails.txt";
    }
  else 
    dFileName = iFileName;
  cout << "Enter output filename [default: " << dFileName << "]: ";
  getline(cin, oFileName);
  enterKey = false;
  if (oFileName.length() == 0)
    enterKey = true;
  if (enterKey == true)
    oFileName = dFileName;
  cout << "\nInput file: " << iFileName << endl;
  cout << "\nOutput file: " << oFileName << endl;
  cout << "\nPlease check your files are in the correct position,\nthen press ENTER key to continue:\n ";
  cin.ignore(1000,10);
 
  //open input file
  ifstream fin;
  fin.open(iFileName.c_str());
  if (!fin.good()) throw "I/O error";

  //extract email
  do
  {
    getline(fin, lineFromFile);
    for (i = 0; i < lineFromFile.length(); i=i+1)
    {
      if (lineFromFile[i] == '@')
      {
        for (s = i; s > 0; s = s - 1)
        {
          if (s < 0) break;
          if(isValidEmailCharacter((lineFromFile[s])) == false) break;
        }
        s = s + 1;
        for (e = i; e < lineFromFile.length(); e = e + 1)
        {
          if (e < lineFromFile.length()) break;
          if (isValidEmailCharacter((lineFromFile[e])) == false) break;
          if (hasDot(lineFromFile[e])) break;
        }//for
      }//if
      if(s < i && e > i && hasDot(lineFromFile[e]) == true)
        {
        string anEmail = lineFromFile.substr(s, e-s);
        cout << anEmail << endl;
        }
      else 
        cout << "Avocado";
    }//for
  }while(fin.good()); //do while
  fin.close(); //close the input file
}//main

//if the chracter is good as for email address
bool isValidEmailCharacter(char c)
{
bool result = false;
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
  result = true;
else if (c >= '0' && c <= '9')
  result = true;
else if (c == '_' || c == '+' || c == '-' || c == '.')
  result = true;
return result;
}

//if there is at least a dot after @
bool hasDot(char c)
{
bool result = false;
if (c == '.')
  result = true;
return result;
}
Last edited on
Help!!!QAQ
Well the code was well-commented on the easy stuff, such as //if user press enter key but there were few clues to the algorithm in the most important part. The first thing I figured out that s and e were meant to identify the start and end of a possible email.

Anyway - the code. A couple of the loops start from the position of the '@' symbol which immediately fails all the tests in isValidEmailCharacter(), so the loop start needs to be adjusted by one position forwards or backwards. The logic around the finding of a dot is a bit flawed, it should not stop (break) when finding a dot, there may be more characters to follow. Better to set a boolean dotFound to assist with that. The main loop searching for the '@' symbol doesn't need to bother trying to output an email if there was no '@', you could add a continue to bypass the rest of that code.

One more thing, the loop to read the lines from the file is better written in this idiom:
1
2
3
4
5
    while (getline(fin, lineFromFile))
    {
        // do some processing of the line here

    }



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
    while (getline(fin, lineFromFile))
    {   
        // First, search for '@' character
        for (int i = 0; i < lineFromFile.length(); i=i+1)
        {            
            if (lineFromFile[i] != '@')
                continue;

            // Identfy start of previous part of email            
            int start;   // start of possible email
            for (start = i-1; start >= 0; --start)
            {
                if (!isValidEmailCharacter(lineFromFile[start])) 
                    break;
            }
            
            ++start;
            
            // identify end of next part of email
            bool dotFound = false;
    
            int end;   // end of possible email
            for (end = i+1; end < lineFromFile.length(); ++end)
            {
                if (!isValidEmailCharacter(lineFromFile[end])) 
                    break;

                if (hasDot(lineFromFile[end])) 
                    dotFound = true;
            }
            
            // final reasonableness check
            // Note: consecutive dots are allowed
            if (start < i && end > i && dotFound)
            {
                string anEmail = lineFromFile.substr(start, end-start);
                cout << anEmail << endl;
            }
        }
    }
Last edited on
Couple of comments on email... almost all characters are legal in email. Including unicode.
On top of that, the back end .XXXX comes from a very small known public list, so you can use that as additional validation. Also make sure it works on deep multi-dot emails like joe.schmoe.smith@somepart.ofsomecompany.atsomelocation.com
Last edited on
Thank you soooooo much chervil and jonnin!!! Thank you Chervil it works perfectly now!! :D
Topic archived. No new replies allowed.