stirngs

How many distinct substrings does a given string S have?
For example, if S = "abc", S has 7 distinct substrings: {"","a","b","c","ab","bc","abc"}. Note that
the empty string and S itself are considered substrings of S.
On the other hand, if S = "aaa", S has only 4 distinct substrings: {"","a","aa","aaa"}.
The first line of the input file contains N, the number of test cases. For each test case, a line
follows giving S, a string of from 1 to 1000 alphanumeric characters. Your output consists of one
line per case, giving the number of distinct substrings of S. Try to write an efficient program.
Sample Input
2
abc
aaa
Output for Sample Input
7
4
****please dont use pointer. i haven't learnt them yet. please help me, i sont even know from where to begin****

This is how I would check for a single string. You can write the reader for the sample input

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include<string>
#include<vector>
#include<iostream>
#include <algorithm> 

int main()
{
  std::string toCheck("aaa");
 
  //Find all substrings
  std::vector<std::string> substrings;
  for(std::size_t i=0;i<=toCheck.size();i++)
    for(std::size_t j=i;j<=toCheck.size();j++)
      substrings.push_back(toCheck.substr(i,j-i));
  
  //sort the substrings alphabetically - this step is required for the unique algorithm
  std::sort(substrings.begin(), substrings.end());
 
  //eliminate duplicates
  std::vector<std::string>::iterator it;
  it = std::unique(substrings.begin(), substrings.end());
  substrings.resize(std::distance(substrings.begin(),it));

   //answer
  std::cout<<substrings.size()<<std::endl;
 
  //sanity check - print all unique substrings
  for (std::size_t i=0;i<substrings.size();i++)
    std::cout<<'"'<<substrings[i]<<'"'<<std::endl;
}


hi:
i think i have an idea about the solution, try searching for the combinatorial function in maths, this function calculates the number of subsets of a given set of values, you can then make an easy algorithm to subtract the identical subsets.
i'm not putting any code, you have to do that part yourself, programming is about experimenting.
Last edited on
@ats15

Nice idea, however, I don't understand the following statement:

1
2
3
  for(std::size_t i=0;i<=toCheck.size();i++)
    for(std::size_t j=i;j<=toCheck.size();j++)
      substrings.push_back(toCheck.substr(i,j-i));


According to me, it should be something like:

1
2
3
  for (std::size_t i=0; i<toCheck.size(); i++)
    for (std::size_t j=1; j <= (toCheck.size() - i); j++)
      substrings.push_back ( toCheck.substr ( i, j ) );

sorry but i can't use vector...can u tell me a simple way to do it?
First, understand the program logic (the algorithm) from this code.
Then write a program which is your own program.

If you skip the second step, you would have learned nothing.
To learn even more: How much time does the program take? Can you make it faster?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include <string>
#include <iostream>

namespace no_stl
{
    constexpr std::size_t MAX_UNIQUE = 1024*128 ;
    std::string uniques[MAX_UNIQUE] ;
    std::size_t n_unique = 0 ;

    bool is_unique( const std::string& str )
    {
        for( std::size_t i = 0 ; i < n_unique ; ++i )
            if( uniques[i] == str ) return false ;
        return true ;
    }

    std::size_t insert_unique( const std::string& str )
    {
        if( is_unique(str) && n_unique<MAX_UNIQUE )
        {
            uniques[n_unique] = str ;
            ++n_unique ;
        }
        return n_unique ;
    }

    std::size_t distinct_substrings( const std::string& str )
    {
        if( !str.empty() )
        {
            for( std::size_t i = 0 ; i <= str.size() ; ++i )
                insert_unique( str.substr(0,i) ) ;
            distinct_substrings( str.substr(1) ) ;
        }
        return n_unique ;
    }
}

namespace use_stdlib
{
    template < typename SET >std::size_t distinct_substrings( SET& set,
                                                       const std::string& str )
    {
        if( !str.empty() )
        {
            for( std::size_t i = 0 ; i <= str.size() ; ++i )
                set.insert( str.substr(0,i) ) ;
            distinct_substrings( set, str.substr(1) ) ; // O(N*N)
        }
        return set.size() ;
    }
}

#include <set>

int main()
{
    {
        using namespace no_stl ;
        const std::string str = "abababc" ;

        distinct_substrings(str) ;
        std::cout << "#distinct substrings of '" << str << "' " << n_unique << '\n' ;
        for( std::size_t i = 0 ; i < n_unique ; ++i )
            std::cout << "'" << uniques[i] << "'\n" ;
    }

    std::cout << "\n---------------------------\n" ;

    {
        std::set<std::string> set ; // or use std::unordered_set
        const std::string str = "abababc" ;

        use_stdlib::distinct_substrings( set, str ) ;
        std::cout << "#distinct substrings of '" << str << "' " << set.size() << '\n' ;
        for( const auto& s : set ) std::cout << "'" << s << "'\n" ;
    }
}

http://ideone.com/aOEVhB
@abhishekm71
1
2
3
  for(std::size_t i=0;i<=toCheck.size();i++)
    for(std::size_t j=i;j<=toCheck.size();j++)
      substrings.push_back(toCheck.substr(i,j-i));

In the first line i is the beginning of substring, in the second line j is the end position. It cannot be less than i. If j=i, then you have the empty string. You can start from i+1, but then you need to add it to your unique substrings. Of course, you are generating the empty substring a lot of times, but the program is supposed to keep only one of them. And it does :)
In the 3rd line j-i is the length of the substring. But I think the two loops are similar
ok.. thanks for the explanation ats15. I actually got confused earlier.

and thanks JLBorges for giving us the 2 options for this same operation without the need for sorting. Although please clarify the following:

1
2
    constexpr std::size_t MAX_UNIQUE = 1024*128 ;
    std::string uniques[MAX_UNIQUE] ;


1. Why have you used array instead of vector? Is it to optimize on time complexity (removes the need for multiple memory allocation in code)?
2. What is the significance of the number 1024*128? Why couldn't you just multiply it and define MAX_UNIQUE as const std::size_t MAX_UNIQUE = 131072 ; ? Any specific reason?
> Why have you used array instead of vector?

Because
sorry but i can't use vector...can u tell me a simple way to do it?
and
****please dont use pointer. i haven't learnt them yet.


> without the need for sorting

The versions I posted are not more efficient. Standard library algorithms were not allowed; the intent was to make the code easy to understand; ergo linear search and append instead of sort/unique/erase.


> What is the significance of the number 1024*128? Why couldn't you just multiply it and define MAX_UNIQUE as const std::size_t MAX_UNIQUE = 131072 ; ?

I find it simpler to visualize the magnitude of the number as 128K if it is written as 1024*128 (or 128*1024 )
ok.. :)
A string with distinct components has exactly s*(s+1)/2+1 substrings where s is the string size(e.g. the string "abc" has 3*4/2+1=7 substrings and the string "abc...xyz" i.e. english alphabet with 26 characters has 26*27/2+1=352 substrings). This fact can be proved as follows:

Let S be a string with s distinct characters.

S = "C1C2...Cs"

Now, selecting substrings whith one component, we have exactly s substrings:

"C1", "C2",...,"Cs".

Now substrings of two components:

"C1C2", "C2C3",...,"Cs-1Cs".

Important note: the index of the first component of each substring gives its rank in the sequence and, of course, the last one gives the number of substrings, s-1.

So, the last substring with three components will be Cs-2Cs-1Cs and its rank will be s-2 who is the very number of substrings whith 3 components, s-2.

Continuing this process we have for the penultimate selection by s-1:

"C1C2...Cs-1", "C2C3...Cs" and, obviously, 2 substrings whith s-1 components each one.
And finally, the last substring is S itself:

"C1C2...Cs", one substring.

Reviewing the entire process we have the final result, adding all the obtained results:

Total_number = s + s-1 + ... + 2 + 1 or:
Total_number = 1 + 2 + ... + s-1 + s and adding the two sums we have:
-------------------------------------------------------------------------------
2*Total_number =(s+1)+(s+1) + ... + (s+1)+(s+1) here we have s times (s+1) and finally:

Total_number = s*(s+1)/2.

Adding the empty string we have the number of distinct substrings of a given string S whith s distinct components.

Important note: this process may be used as a guide to write the necessary code for the final step: removing duplicates if S has some identical components.
Last edited on
Topic archived. No new replies allowed.