Data design and boolean text search query evaluation for a small text search engine

Hello,
I'm doing a small school project to design a program to use search if a word/words exist in text file(s) or not, using boolean text search query like foo and fum, foo and test or bar, etc...
I'm still struggling with the data design, what I have in mind right now is to read in all the data text files (it's a small project so only maximum 30 of them I think) and store the words in an AVL tree. Each node of the tree will contain a string, which is the word stored and a boolean array with size equals to the number of data files, each element in the array will correspond to 1 file : true if the word is in the file or false otherwise.
Is that a good idea?

One more thing that's bugging me is how to evaluate the Boolean Text Search Queries like when the user input foo AND fum, the program will return the name of all files containing both foo and fum. I'm thinking about using an expression tree for this but I haven't practiced with it much so I don't know if that will work. Can you please give me an example to turn something like "test or bar and ben" into an expression tree and work on it (and takes precedence over or) or is there a better way?. And since the number of words to be searched in a query can be arbitrary large but the my search function for the AVL tree only returns 1 node at a time, how can I deal with that?

Thank you for your time and patience.
Cheers
A trie would be more efficient and easier to implement.

First convert the expression into postfix form. Shunting yard works for this.
1
2
3
4
5
6
7
8
9
10
11
12
13
for each token in expression
    if token is word
        push onto the evaluation stack a list of all files containing the word
        continue
    //token is not word
    pop two lists off the top of the evaluation stack
    if token is and
        get the intersection of the lists
    else
        get the union of the lists
    push the result onto the evaluation stack
the final result is at the top of the evaluation stack
if the evaluation stack has more or less than one element, there was an error
Thank you for your suggestion.
I'm still a bit unclear on using the expression in postfix form, would it be more efficient and faster compared to using an expression tree in terms of converting from the infix form and evaluation time.
By the way, I'm curious about the algorithm to convert the infix boolean expression into postfix form or a binary expression tree. Can you share it with me please.
And if I use tree, will I need 1 Trie for each text data file or is there any other way around? Sorry that I can't quite figure it out since I've just only read and learn about Trie after your suggestion
Last edited on
would it be more efficient and faster compared to using an expression tree in terms of converting from the infix form and evaluation time.
Yes.

By the way, I'm curious about the algorithm to convert the infix boolean expression into postfix form or a binary expression tree. Can you share it with me please.
One algorithm is called "shunting yard".

And if I use tree, will I need 1 Trie for each text data file or is there any other way around?
You can store which files have that word at the leaves.
There's still 1 thing that I'm not certain after reading some documents about Trie is that as I understand it, I would need either an array or a vector to store the child nodes for each nodes in the tree. In the examples I've read, often people use an array of size 26, each character in the alphabet is 'assigned to an index, that's quite fast for searching since I can just use the array index directly but I can't use any other things like numbers and stuffs like %^&$,... in my program since I don't know how big an array I will need. Another example I've read use a vector, which can be as big as I need but then I can't map a character to its corresponding index, when performing a search I'd have to loop through all the child nodes in the worst case, would that still be faster than an AVL tree and can there be a more efficient way to implement a Trie class.
Thank you very much for the great support you have provided me.
but I can't use any other things like numbers and stuffs like %^&$,... in my program since I don't know how big an array I will need.
An array of size (1 << CHAR_BIT) is big enough to hold all possible values of a char, but such an array of pointers will, in most platforms, use 1 or 2 kilobytes of memory per node.

I'd have to loop through all the child nodes in the worst case, would that still be faster than an AVL tree
Let's assume that checking for existence of a character in a trie node needs a linear search.
A trie requires up to k*n character comparisons to check for existence, where n is the length of the query and k is the size of the alphabet. An AVL requires up to log m string comparisons where m is the number of nodes in the tree.
AVL: O(log m) over size of structure, O(n log m) over size of query
Trie: O(1) over size of structure, O(n) over size of query
However, note that a trie with a linear search at each node can be beaten by an AVL tree if the word count is very small, because the constant factor is bigger for the trie than for the AVL tree.
A trie that can find the next node in O(1) is always faster than the AVL tree, but uses more memory.
Wow, your answers are always come very fast and full of valuable details. My most sincere appreciation to you, mate. Hope one day I can be as good as you to help others like this
Topic archived. No new replies allowed.