Find Whether a string exists(or can be made up of) in a given set of characters

I have written a solution (kind of c/c++) , but looking for kind of optimized c++ solution for below problem.

example:"Shrek" can be found in "abcdgdeSghjurjkasdl"

case/order of characters does not matter.

sorry, I tried formatting, but somehow its not applying.., preview does not show anything.

int class_test ::check_is_pattern_ok(std::string message,std::string pattern)
{
int i = 0;
int base_message_count[27] ={'\0'};
while(message[i])
{
// pattern is madeup of alphabets, so ignore other special chars
if(isalpha(message[i]))
{
base_message_count[((tolower(message[i]))-'a')]++;
}
i++;
}
return check_for_pattern( base_message_count, pattern);
}

int class_test::check_for_pattern( int base_message_count[], std::string pattern)
{
int i =0;
int temp_base_msg_count[27] = {0}; // Number of Alphabets + 1
int pattern_len = pattern.size();
memcpy(temp_base_msg_count,base_message_count,27*sizeof(int) );

while(i<26 && (pattern[i]) )
{
int index = tolower(pattern[i]) - 'a';
if(temp_base_msg_count[index]) {
pattern_len--;
--temp_base_msg_count[index];
}
i++;
}
if(!pattern_len) {
return true;
}else {
return false;
}

}
Last edited on
"Shrek" can be found in "abcdgdeSghjurjkasdl"
Let the first string be x and the second string be y.
1. Sort the characters of y lexicographically. E.g. "abcdgdeSghjurjkasdl" sorted is "Saabcdddegghjjklrsu".
2. For every character c of x:
2.a. Perform binary search of c on y.
2.b. If c is not found then "x cannot be found in y".
2.c. Otherwise continue to the next character.
3. If the loop passed through all the characters then "x can be found in y".
Last edited on
Optimal is going to depend a bit on the use-case. Are you looking for one word in a whole bunch of short strings? That is one use-case. Are you looking for a whole lot of words in one input string, is another. Or something else -- will any of the strings be massive? Are they ascii or unicode or other codeset?


for example, ascii lends itself to a 256 bucket of 'has, does not haz' for each letter...

Twisting the above example algorithm a little, in pseudocode:
table[256] = false;
for(all the letters in your input)
table[string[letter]] = true;
...
for(result = true, all the letters in your 'is it there' string and result, letter++)
result &= table[string[letter]]; //only true at the end if each letter was in the input

this avoids a NlgN sort and lg(n) search per letter by exploiting direct access. If its unicode, its not practical to do that.

if you need copies, eg is
shreeeek in shrek (its not, needed 3 es distinctly)
you can do it with int instead of bool and ++ instead of true to see if you have enough copies of a letter.
if case does not matter, you can toupper or tolower the inputs (both!) before and it still works, but that takes extra work. You can bypass this extra loop as well, by rolling it forward into your sort code (set both versions there) and the search code still remains the same.

bottom line of all that: its can be solved in O(n) where n= the length of both (or in the case of searching for many, all of the) strings added together for ascii and other 8-bit codes.
Last edited on
Even if Unicode, you can still perform an extra-quick associative array lookup using an unordered_map.
Jonnin, Duthomhas,
Can you guys please give an example by some C++ code, I am not really good with modern C++.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <iostream>
#include <limits>
#include <string_view>

// modern C++: so we use the string_view type
// https://devblogs.microsoft.com/cppblog/stdstring_view-the-duct-tape-of-string-types/ 
// we assume that the character encoding is a single byte encoding: 
// ie. each byte (char) encodes exactly one character.   
bool can_be_made_from( std::string_view string_to_be_made, std::string_view given_chars )
{
    // NUM_DISTINCT_CHAR_VALUES - the number of distinct values that a char can take.
    // why we can't (or at least shouldn't) hard-code this as the value 256:
    // https://isocpp.org/wiki/faq/intrinsic-types#bits-per-byte

    // relies on: "Type char is a distinct type that has an implementation-defined choice
    //            "of “signed char” or “unsigned char” as its underlying type."
    //            https://eel.is/c++draft/basic.fundamental#7
    // and:       "An unsigned integer type has the same object representation, value representation, and alignment
    //             requirements as the corresponding signed integer type."
    //            https://eel.is/c++draft/basic.fundamental#3
    // and:       "For narrow character types, each possible bit pattern of the object representation
    //            "represents a distinct value."
    //            https://eel.is/c++draft/basic.fundamental#7

    // https://en.cppreference.com/w/cpp/types/numeric_limits/max
    // note:      + 1 because zero too is a distinct value
    constexpr std::size_t NUM_DISTINCT_CHAR_VALUES = std::numeric_limits<unsigned char>::max() + 1 ;

    // get the frequency count of the characters in given_chars
    // std::size_t: https://en.cppreference.com/w/cpp/types/size_t 
    std::size_t frequency_count[NUM_DISTINCT_CHAR_VALUES] = {0} ; // initialise to all zeroes

    // range based loop: http://www.stroustrup.com/C++11FAQ.html#for
    // note: unsigned char because the default char may be a signed integer type
    for( unsigned char u : given_chars ) ++frequency_count[u] ;

    // try to take characters one by one for forming the required string
    for( unsigned char u : string_to_be_made ) // for each character
    {
        // if a character is available, take it (after which, one less remains available)
        if( frequency_count[u] > 0 ) --frequency_count[u] ;

        // otherwise there is no char available; the check fails
        else return false ;
    }

    return true ; // if we reach here, we were able to form the required string
}

int main()
{
    // example: "Shrek" can be found in "abcdgdeSghjurjkasdl"
    std::cout << std::boolalpha << can_be_made_from( "Shrek", "abcdgdeSghjurjkasdl" ) << '\n' ; // true

    // example: "sleeplessness" can't be found in "eeeeeelllllnnppssss" (missing an 's')
    std::cout << std::boolalpha << can_be_made_from( "sleeplessness", "eeeeeelllllssssnnpp" ) << '\n' ; // false
}

http://coliru.stacked-crooked.com/a/7cae5ea4d19d7af7
JLBorges, your code and comments are good, infact this is almost same as my c version as per the logic goes.

Thanks for converting my C to proper C++ code, which I lack really :(

May be I have not explained well, we need not to bother the case of characters when checking,

the function can_be_made_from should return true for both below cases

1) can_be_made_from( "Shrek", "abcdgdeSghjurjkasdl" ) == true // all characters from given_chars are matching case with string_to_be_made

2) can_be_made_from( "Shrek", "abcdgdeSghjuRjkasdl" ) == false //(not all characters from given_chars are matching case, because r is capital)

which is anyway pretty easy, i simply use frequency_count[tolower(u)] instead of frequency_count[u]

I have a doubt here , why we have to go with string_view instead of string
Last edited on
> I have a doubt here , why we have to go with string_view instead of string

We don't have to; but going with std::string_view offers more flexibility, and often better performance. std::string_view can represent any contiguous sequence of characters; it allows us to avoid creating temporary std::string objects; it allows us to work with substrings without having to make a copy of the characters; and it is easy to write correct code while using it.

for example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <string_view>
#include <string>
#include <vector>

// foo accepts a generic sequence of characters
void foo( std::string_view str )
{
    std::cout << "foo( '" << str << "' )\n" ;
}

int main( int argc, char* argv[] )
{
    foo( "abcd" ) ; // call with literal with near zero overhead (constexpr constructor)

    if(argc) foo( argv[0] ) ; // call with c-style string

    const char cstr[] = "****abcd****" ;
    foo( { cstr+4, 4 } ) ; // call with (non-null-terminated) substring of a c-style string

    const std::string str = "abcd" ;
    foo(str) ; // call with standard string

    const std::vector<char> vec { '$', 'a', 'b', 'c', 'd', 'e' } ;
    foo( { vec.data()+1, 4 } ) ; // call with a arbitrary (contiguous) sequence of characters
}

http://coliru.stacked-crooked.com/a/4075d29a2f76fa5b

Note: std::string_view provides a non-owning, immutable (read-only) view of a (contiguous) sequence of characters; in situations where we want to modify the character sequence or take ownership of the characters, we would continue to use std::string. -
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <iostream>
#include <string>
#include <vector>
using namespace std;

class Letterset
{
   vector<int> freq;

public:
   Letterset( const string &str );
   bool makeable( const string &test );
};

Letterset::Letterset( const string &str )
{
   freq.resize(256);
   for ( unsigned char c : str ) freq[tolower(c)]++;
}

bool Letterset::makeable( const string &test )
{
   vector<int> f = freq;
   for ( unsigned char c : test ) if ( --f[tolower(c)] < 0 ) return false;
   return true;
}

//----------------------------------------------------------

int main()
{
   Letterset letterset( "abcdgdeSghjurjkasdlmnif" );
   string tests[] = { "Shrek", "Bambi", "Gremlins", "Elf", "Casablanca" };
   for ( const string &s : tests ) cout << s << ( letterset.makeable( s ) ? " can " : " can not " ) << "be made\n";
}
Thanks JLBorges and All for your answers
Last edited on
Weird.

Both solve the immediate problem if the strings involved are so constrained - string_views make particular sense.

Out of interest, if this is a general attempt though, and who knows, what is the utility of it given the monkey-typewriter-Shakespeare hypothesis?

The question is, given there are only 26 letters, "Are the 'hidden' words meaningful?".

I suppose they are if someone has written a naive code making/breaking program, or whatever.
I just realized , string_view is only supported in C++17, currently what i use is only C++14, i can enable C++17, but would like to know if there is any way to do the following.

My intention is just to write a portable code across all compiler exist so far.

1
2
3
4
5
6
7
8
9
if(CPP_VERSION == C++17)

bool can_be_made_from( std::string_view string_to_be_made, std::string_view given_chars )

else if(CPP_VERSION == C++14)

bool can_be_made_from( std::string string_to_be_made, std::string given_chars )

else if (.. check for old version till the first..)
I am not surprised about the C17 feature, I wasn't aware of it until now.

Other people will have to advise you on portability of your code, my interest is primarily the utility of your program as my comment above.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>

int main() {

#if __cplusplus == 199711L
    std::cout << "C++98\n";
#elif __cplusplus == 201103L
    std::cout << "C++11\n";
#elif __cplusplus == 201402L
    std::cout << "C++14\n";
#elif __cplusplus == 201703L
    std::cout << "C++17\n";
#else
    std::cout << "WTF???\n";
#endif

}

> currently what i use is only C++14, i can enable C++17,
> My intention is just to write a portable code across all compiler exist so far.

Simple and portable: just use const std::string& for immutable input string parameters.
againtry, don't understand what you are trying say, sorry, can you please write in simple words to understand..,or any example...
@ravss2
Knowing about this might help.
https://en.wikipedia.org/wiki/Infinite_monkey_theorem

In short, if your string being tested is random and long enough, with so few search words you will probably always find them. If that's the case then what's the use of the program? Purely out of interest :)
1
2
3
4
5
6
#include <iostream>
int main( int argc, char** argv )
{
  std::cout << __cplusplus << std::endl;
  return 0;
}


And from C++ shell:
201300
ohh, I understand , actually I am working on a problem where I need to check for a message from a stream of characters , so just working on solutions that's all,

and I have certain long messages which does not have my secret messages I am looking for..
ohh, I understand
Good! I was guessing some form of secret messaging might be involved. The next problem is to make sure the message you have uncovered from the stream is real or a random monkey. That would be 'interesting'.

Cheers


Topic archived. No new replies allowed.