Feedback on Text Processing

Hey there, I'm new to the site.

I'd like to put up a piece of code I've been working on and get some feed back on what I could do differently, or what I could be doing that would improve how this works.

It's a text parser built with the intention to make it a subroutine, which is why it's all self contained.

EDIT: To clarify a bit, it's written this way at the moment for testing, before I implement it I'll end up converting to a single function, as I needed to test how I could do this to get an idea of how to structure the rest of the pieces.

Please do be gentle and I'll respond as much as I can.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

/*
*  Text Parser
*
*  Basically I'm trying to parse a given line at a time.
*
*  I need to:
*    * Take in a line.
*    * Seperate each word on the line, removing punctuation as I go.
*    * Convert every word stored to entirely uppercase
*
*  I provide output in most of the steps below, formatted so its easier to see the words stored at each stage.
*  This is for debugging and verifying that things are working.
*
*
*  Quite a bit of this is based on work demonstrated on a tutorial
*  blog located here:
*   http://cplussplussatplay.blogspot.com/2012/11/text-adventure-games-c-part-1.html
*/

#include <iostream>
#include <string>
#include <vector>
#include <cctype>

using namespace std;


int main()
{

   char lookingFor = ' ';
   // char searchAndDelete[] = {'.', ',', '?','!'}; // Use this instead of hard coding?
  string rawInput;
  string tempString;
  vector<string> word;

  /// This isn't elegant by any means, but it works, lets go through it:

  cout << ">: ";

  getline (cin, rawInput);  // Get one line at a time of input


  for (unsigned int i = 0; i < rawInput.size(); i++)                 // For the size of the line
  {
    if (rawInput.at(i) != lookingFor)                        // If the char@i is NOT a space
    {
      if (rawInput.at(i) != '.' && rawInput.at(i) != ',' && rawInput.at(i) != '?' && rawInput.at(i) != '!')
      {
        tempString.insert(tempString.end(), rawInput.at(i));  // Add it to the temporary string
        cout << tempString << " " <<endl;
      }
    }

    if (rawInput.at(i) == lookingFor)  // If the car@i IS a space
    {
      word.push_back(tempString);     // Add the entire temporary string to the word vector
      tempString.clear();             // Then clear the temporary string
    }

    /// This one is a little more tricky:
    if (i == (rawInput.size() - 1))  // If (the value of)i is equal to the size of the entire line minus 1
    {                                // (This is how we check if we are at the end of the line)
      word.push_back(tempString);    // Add the entire temporary string to the word bank.
      tempString.clear();            // Then clear the String.
    }
  }

  // At this point we arent using the raw input anymore so we'll go ahead and clear it as well
  rawInput.clear();

  cout << "====================\n";

  for(unsigned int i = 0; i < word.size(); i++)
  {
    cout << word[i] << endl;
  }

  cout << "====================\n";

  for(unsigned int i = 0; i < word.size(); i++)               // For the size of the vector (# of words)
  {
    for(unsigned int j = 0; j < word.at(i).size(); j++)       // For the # of letters in the word@i
    {
      word.at(i).at(j) = toupper(word.at(i).at(j));  // Convert each letter to uppercase
    }
    cout << word.at(i) << endl;                      // Print the finished word.
  }

  cin.ignore();

return 0;
}
Last edited on
It's a text parser built with the intention to make it a subroutine, which is why it's all self contained.

If you have the intention to make it a subroutine then is should be a subroutine (a function) instead of all contained in main().

Please use code tags when posting code.

Thank you for the input, I added an edit to clarify why I had it written this way, as for the code tags, I haven't been in a forum in quite awhile so I wasn't sure how to do it, but it should be fixed now.

Regardless I'm going to do some reading to make sure I can use tags a bit better in the future.
Using library functions, C++11 and some logical separation of concerns to break things down into functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include <algorithm>
#include <functional>
#include <iostream>
#include <string>
#include <vector>
#include <cctype>

std::string filter(std::string s, std::function<bool(char)> f)
{
    s.erase(std::remove_if(s.begin(), s.end(), std::not1(f)), s.end());
    return s;
}

std::string transform(std::string s, std::function<char(char)> t)
{
    std::transform(s.begin(), s.end(), s.begin(), t);
    return s;
}

std::vector<std::string> split(const std::string& s, const std::string& delimiters)
{
    std::vector<std::string> result;

    std::size_t start = 0;
    std::size_t end = start;
    while ((end = s.find_first_of(delimiters, start)) != std::string::npos)
    {
        if (end != start)
            result.push_back(s.substr(start, end - start));
        start = end+1;
    }

    if (start < s.size())
        result.push_back(s.substr(start));

    return result;
}

int main()
{
    std::string line;

    std::cout << ">: ";
    getline(std::cin, line);

    line = filter(std::move(line), [](char ch) { return std::isalnum(ch) || std::isspace(ch); });
    line = transform(std::move(line), std::toupper);

    for (auto& word : split(line, " \t"))
        std::cout << word << '\n';
}
When I compile this, I seem to be running into an error. I'm sure this can be something simple on my end, I am using GCC with CodeBlocks, with the c++11 standard enabled(it's a little check box) on Ubuntu Studio. I'll put the output here:

|47| error: no matching function for call to ‘transform(std::remove_reference<std::basic_string<char>&>::type, <unresolved overloaded function type>)’|


If there is anything I'm missing please let me know, I get the jist of what the code is meant to do but I lack experience with the STL and the C++11 standard.

Aside from that, I'm a bit jealous of how much cleaner that looks, haha. Thank you for the response, it occurs to me I need to read up on the STL much more.

I am no where near where I need to be in terms of skill for how long I've been programming, so I appreciate the help!

In the mean time I'll do some light research on the STL.

There are two overloads of std::toupper; so template type deduction fails.

int toupper( int ); declared in <cctype>
and template< class charT > charT toupper( charT , const locale& ); declared in <locale>

modify line 47 to:
1
2
// line = transform(std::move(line), std::toupper);
line = transform( std::move(line), []( char c ) { return std::toupper(c) ; } ) ;


Version using the standard regular expressions library:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <regex>
#include <string>
#include <cctype>

std::vector<std::string> split( const std::string& str )
{
    // pattern: one or more alphanumeric characters, in a numbered capture group
    static const std::regex pattern( "(\\w+)" ) ;

    // http://en.cppreference.com/w/cpp/regex/regex_token_iterator
    return { std::sregex_token_iterator( str.begin(), str.end(), pattern, 1 ), // start with capture #1
             std::sregex_token_iterator() } ;
}

std::string toupper( std::string str )
{
    for( char& c : str ) c = std::toupper(c) ; // http://www.stroustrup.com/C++11FAQ.html#for
    return str ;
}

int main()
{
    const std::string line = "How now, brown (coffee-brown!) cow?" ;
    for( const auto& token : split(line) ) std::cout << toupper(token) << '\n' ;
}

http://coliru.stacked-crooked.com/a/2d352b48522d0f63
I must thank you for the correction, as well as the very readable code and the reference URL's. Those were proven valuable as I started to read through this as well as the follow up reading on the c++14 standard I was able to do in looking for a solution to the (easily solved) compiler errors I was getting, I just had to update my compiler to one that recognized the -std=c++14 flag.

Aside from that, again I am incredibly jealous of how clean your version looks, even more so than the previous example. I am going to explore the c++14 features, as I wait for future feedback and/or for my thread to be closed if the discussion dies down.

Thank you very much for the help, I don't often get the chance to receive feedback on these things due to a lack of local community interest in this particular fiend, which is why I'm looking to reach out a bit, haha.


If I may ask a related question, is there anything here I'm doing that I should be avoiding?

Learning from out of date references as I have been in the past I'm sure there are some bad habits I might have developed that I should be correcting.
Last edited on
Using a simple state machine:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <iostream>
#include <string>
#include <vector>
#include <cctype>

int main()
{
    std::string line;
    std::vector<std::string> words;

    std::cout << ">: ";
    getline(std::cin, line);

    bool inWord = false;
    std::string word;           // current word
    for (auto ch : line) {
        ch = toupper(ch);
        switch(isalpha(ch)) {
        case true:
            if (!inWord) {
                if (word.size()) {
                    words.push_back(word);
                    word.clear();
                }
                inWord = true;
            }
            word += ch;
            break;
        case false:
            inWord = false;
            break;
        }
    }
    // Do the last word
    if (word.size()) {
        words.push_back(word);
    }

    for (auto &word : words) {
        std::cout << word << '\n';
    }
}

If I may ask a related question, is there anything here I'm doing that I should be avoiding?

Use [] instead of at() to make the code more readable. Also you can append to a string with +=. Combining these, line 51 of the OP could be
tempString += rawInput[i]; // Add it to the temporary string
Lines 84-87 go back over the strings to convert to upper case. It would be easier to convert as you go along. With this in mind, line 51 becomes:
tempString += toupper(rawInput[i]); // Add it to the temporary string
> how clean your version looks

The code looks cleaner because it uses facilities (regular expressions) in the standard library,
and a couple of language features (uniform initialisation, range-based for loops) that you were unaware of when you wrote the code.
Link for uniform initialisation (function return value): http://www.stroustrup.com/C++11FAQ.html#uniform-init

Your code would progressively start looking cleaner, and be more robust as you "explore the c++14 features" .

Prefer the standard library to other libraries and to "handcrafted code"

Reason
Code using a library can be much easier to write than code working directly with language features, much shorter, tend to be of a higher level of abstraction, and the library code is presumably already tested. The ISO C++ standard library is among the most widely known and best tested libraries. It is available as part of all C++ Implementations.


Prefer suitable abstractions to direct use of language features

Reason
A "suitable abstraction" (e.g., library or class) is closer to the application concepts than the bare language, leads to shorter and clearer code, and is likely to be better tested.

- CppCoreGuidelines
https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md



> is there anything here I'm doing that I should be avoiding?

You may want to reconsider having 'using namespace std;' at global namespace scope as the default in every source file.

See: http://www.cplusplus.com/forum/general/72248/#msg385442

I've seen working C++98 code break because of careless use of 'using namespace xxx;'

Somewhat simplified example of the code:
(C++98 did not have std::copy_if, it was an omission; this algorithm was added in C++11)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <iostream>
#include <algorithm>
#include <vector>
#include <functional>
#include <iterator>

namespace utility
{
    template< typename INPUT_ITERATOR, typename OUTPUT_ITERATOR, typename PREDICATE >
    void copy_if( INPUT_ITERATOR begin, INPUT_ITERATOR end, OUTPUT_ITERATOR dest, PREDICATE fn )
    {
        for( ; begin != end ; ++begin ) if( fn(*begin) ) *dest++ = *begin ;
    }
}

using namespace utility ; // *** trouble

int main()
{
    std::vector<int> all_numbers { 1, 2, 3, 4, 5, 6, 7 } ;
    std::vector<int> odd_numbers ;

    // without the using directive, the code would have been this and things would be fine with C++11
    utility::copy_if( all_numbers.begin(), all_numbers.end(), std::back_inserter(odd_numbers),
                      std::bind2nd( std::modulus<int>(), 2 ) ) ;

    // with the using directive, the code was this. *** error ***: call to 'copy_if' is ambiguous in C++11
    // see Koenig lookup (ADL): https://en.wikipedia.org/wiki/Argument-dependent_name_lookup
    copy_if( all_numbers.begin(), all_numbers.end(), std::back_inserter(odd_numbers),
             std::bind2nd( std::modulus<int>(), 2 ) ) ;
}

http://coliru.stacked-crooked.com/a/f0f45cbde661a349

Note: CppCoreGuidelines appears to have nothing on 'using namespace' directives; so I suppose this would count as personal opinion.
Topic archived. No new replies allowed.