Creating a hashtable from File Contents(Unknown Size)

I would like to create a hash table by reading the number of entries from a file, file size may be in KB’s or MB’s or even GB’s,
For example: My File has below data
person1_name,person1_father,person1_mother
:
:
personK_name, personK_father, personK_mother

Every entry will be terminated by a new line.

As text file behaviour varies from system to system , I would like to use binary files.

I can get the file size by
1
2
3
4
size = ftell(fp,0,SEEK_END), then allocate memory by
char *buffer = malloc(size); and the contents of file using
fread(buffer,1,size,fp);


challenge here is at the same time i have to parse the file for number of entries
1
2
3
4
5
6
7
8
9
10
11
parseBuffer(char *buffer)
{
	char * line = getLine(buffer);
	Person *temp_p = getPersonDetails(line);
	creatAHashTable(temp_p->person_name, Person)
}
unsigned int GetHashKey(char *person_name)
{
	/* some process goes here*/
	return () % NUMBER_OF_ENTRIES;
}

Now, how will i calculate NUMBER_OF_ENTRIES at the same time and create a hash table?
I can parse the entire buffer to calculate number of entries, but again, i will have to read the entire buffer for creating my
 
Person(name,fathername,mothername);


so I will have to read the file two times , once to get the count and then to create my hash table.
Is there some better design to achieve this?
Last edited on
you may write the number of entries at the start of the file
you may make all the entries to have the same width
Is there another solution, I tried thinking hard, but dint click any.
I guess, there is no good way, other than iterating 2 times over the contents.
Or, maybe don't care about the # of entries? You don't need it to be exact. Make a math function that relates size of file to # of entries, err on the side of making your table too big, and use this best guess as your table size. Also you don't want the table to be the same as # of entries anyway, you will have a huge mess of collisions. You want the table to be a good multiple of the # of entries, eg 3-5 times bigger than the data count, so your best guess overestimated is going to do just fine.

binary files are almost always done with fixed width. Are you not using fixed width?! I highly advise it, then you can just count the size and divide by the width for # of records as already stated. Variable length record binary files takes away most of the advantages!
Last edited on
Yeah, you are also right, but if we have some invalid entries in the files, then we allocate more memory unnecessarily..,that can be avoided, but anyway that's case dependent..

Don't want to add first line as size, as its a text file.
I guess better to use XML only
then we allocate more memory unnecessarily <- this is the point of hash tables, really. Its the time space tradeoff, using massive space to get near zero time. If you want to be space efficient, you don't want to hash. If you want a happy medium you can make a smaller table with deeper buckets (vectors that grow to fit...). Even a linear search on 1000 items is pretty fast....
Last edited on
creatAHashTable

Even Dennis Ritchie agreed that it was retarded to leave the 'e' off of "create".
binary files are almost always done with fixed width. Are you not using fixed width? ==> I am not creating file in "wb" mode, I have created a .txt with some contents and reading using "rb", is it wrong to do..?
its not technically wrong. But you are most likely making more trouble for yourself.
Topic archived. No new replies allowed.