Is this possible?

got it, thanks for the help everyone :D
Last edited on
Hello! How are you? :)

You can do two things. But if you're talking about million of lines, I would discart one of them:
1. You may download a dictionary text file, read it into a vector then search each word in this vector (but a set will have a better performance...). This would crash, but there's a chance of not happening.
2. (Discart it) Using network programming. Search for a site which tell you if the word exists. It will not increase the RAM usage as the first will, but the connection may be slow and not work.
3. Search for a library ;). Code::Blocks uses one of these.

iQChange
Read your dictionary into a std::set. A standard English dictionary (words only) will easily fit into memory. Read your file of words to test sequentially. No reason to read it into memory in it's entireity. Test words by using set::find and write to a new file if found in your dictionary set.

One thing you did not indicate was if you needed to detect duplicate words in the file. That's harder if the word file is not in alphabetical order.
Thanks for the replies. :)

The list doesn't have any duplicates and is already in alphabetical order. I just need to get rid of unwanted gibberish and words with numbers/symbols in them. All I want is to see what words in the list are actual real, English words.

How do I make the lists compare with each other? Sorry if I sound really stupid; I'm pretty new to this.

I have my new list of English words I want it to be compared to.
Last edited on
I want it to get rid of any line that contains a word that isn't in an English dictionary.


You'll have to adjust the following which assumes that "line" is a single token and uses stringstreams rather than filestreams for my convenience, but it should give you the basic idea:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
//  http://ideone.com/EfbLPW
#include <iostream>
#include <sstream>
#include <unordered_set>
#include <vector>
#include <string>
#include <iterator>

std::istringstream tokens_in(
R"(00sdfdsf
ahdadsg
angel
ksjflsjdf
green
green000
carrot
)"
);

std::istringstream dict_in(
R"(angel
carrot
green
kitten
zoo
)"
);

using dictionary_type = std::unordered_set<std::string>;
dictionary_type read_dictionary(std::istream& is)
{
    using iter_type = std::istream_iterator<std::string>;
    return dictionary_type(iter_type(is), iter_type());
}

std::vector<std::string> filter(std::istream& token_stream, const dictionary_type& dictionary)
{
    std::vector<std::string> filtered_tokens;

    std::string token;
    while (token_stream >> token)
        if (dictionary.count(token))
            filtered_tokens.push_back(token);

    return filtered_tokens;
}


int main()
{
    auto filtered = filter(tokens_in, read_dictionary(dict_in));

    for (auto& token : filtered)
        std::cout << token << '\n';
}


OP can you put your original post back please?
Cheers.
Topic archived. No new replies allowed.