Work with own file format

Hi everyone!
I am having problems to create my own file format.
For example, there is a document markup file. I have not problem reading it. But how to organize a container in which save a formatted document and how to use it? I created various methods, but always something broke that and ruined the idea. Google does not give out any information about it.
I'm not interested in ready-made libraries or switching to html (although the idea comes down to parsing html, but I don’t know how to do it myself).
Thanks for any help)
What information does this file contain?
What information does this file contain?

I do not think this is important. But let's imagine that this is a regular text file. For example, such
Hello my |dear| friend.
I want all words marked with | were bold.
It's a wider data structures issue, rather than a single container issue. It isn't the kind of thing you can Google.

Perhaps, if you think about the kinds of operations you want to perform on this data, that'll guide you to a convenient way to store it in memory.

But to be honest, from what you've posted, I can't think of any single thing that's might help.
1
2
3
4
5
It's a wider data structures issue, rather than a single container issue. It isn't the kind of thing you can Google.

Perhaps, if you think about the kinds of operations you want to perform on this data, that'll guide you to a convenient way to store it in memory.

But to be honest, from what you've posted, I can't think of any single thing that's might help.


Perhaps you are right. But there is such a thing as html, which is analyzed using similar logic. But I did not find even the simplest HTML parser. Can you tell me how it works?
your own format needs your own reader. There isn't any way to use your special personal format to tell most editors to make text bold from some tag.

but html, lets talk that.
html is plain text files that contain special groups of symbols -- lets keep it at the pseudocode level for now.
so your text file in html type language looks something like

Hello my <bold>dear <endbold> friend.
and your program does a getline and looks thru the string for <bold> and <italics> and <red> and <underlined> and whatever you want to support. If it finds them, it then looks for the end marker. Then it chops the string up at those markers and uses rich text type box to display them in the color/font/enhancements/etc that were selected. The parser has to be smart enough to handle nested tags, like bold and italics both on one chunk.

Its really that simple: you need those 3 things and you can do a simple markup viewer/file format
1) tags you can parse
2) viewer that can apply the commands given by the tags (eg notepad can't do that)
3) ability to handle nested tags.

It can get more complicated from there, of course, but that is the first take at it.

There are libraries that parse html. The problem with html is it has a lot of support for other stuff, like embedded images, frames, spreadsheets, videos, scripting, and so much more. Reading source code for it is going to be a nightmare to a beginner. I would say take a look at rich text format, but I believe it has gotten more complex as well, but it may still be simpler than html. From there, I dunno… notepad++ supports this kind of thing without the images/widgits/other stuff, maybe something in there?


Let me ask something else... can you write a very simple text editor program that uses a rich text type display where you can set the font/color/enhancements/etc in code, for example make every other word red and underlined? If not, you need to get this skill set first, so you can have a way to test your file format parser etc. Don't forget some way to recover from errors gracefully, if there are missing tags or other problems. For now I would ignore end tags that lack a beginning and just never stop applying the attributes for begin tags that never end.
Last edited on
oh, and once you have the thing working, your editor/viewer should allow users to select a chunk of text and apply or remove an attribute to it, and it would adjust the file accordingly.

it is very possible that visual studio or other tools let you save and load a rich text box as a RTF file directly.
Last edited on
Wow, thank you very much for such detailed explanation on this subject. I had a few more questions.

Hello my <bold>dear <endbold> friend.
and your program does a getline and looks thru the string for <bold> and <italics> and <red> and <underlined> and whatever you want to support. If it finds them, it then looks for the end marker. Then it chops the string up at those markers and uses rich text type box to display them in the color/font/enhancements/etc that were selected. The parser has to be smart enough to handle nested tags, like bold and italics both on one chunk.

I correctly understand that this should be a variable in which stored all the tags and by which the text is divided?
I.e. if there is html text
Hello my <b>dear</b> <i>friend</i>.
then it is divided into such parts [Hello my ][dear ][ ][friend ][.]?

Its really that simple: you need those 3 things and you can do a simple markup viewer/file format
1) tags you can parse
2) viewer that can apply the commands given by the tags (eg notepad can't do that)
3) ability to handle nested tags.

Block 3 deals with this split, but this data needs to be passed to 2. How to save this information so that it is easy to use? Because I need to know that "dear" should be a bold, and "friend" should be an italic. Save to such a structure
1
2
3
4
struct formatText {
    string text;
    int format;
}

where the format is 0 - regular, 1 - bold, 2 - italic etc.

There are libraries that parse html. The problem with html is it has a lot of support for other stuff, like embedded images, frames, spreadsheets, videos, scripting, and so much more. Reading source code for it is going to be a nightmare to a beginner. I would say take a look at rich text format, but I believe it has gotten more complex as well, but it may still be simpler than html. From there, I dunno… notepad++ supports this kind of thing without the images/widgits/other stuff, maybe something in there?


Let me ask something else... can you write a very simple text editor program that uses a rich text type display where you can set the font/color/enhancements/etc in code, for example make every other word red and underlined? If not, you need to get this skill set first, so you can have a way to test your file format parser etc. Don't forget some way to recover from errors gracefully, if there are missing tags or other problems. For now I would ignore end tags that lack a beginning and just never stop applying the attributes for begin tags that never end.

I think I will be able to write a regular notebook with displaying formatted text. I do not need pictures, videos, etc. yet. Just plain text.
why not store it like this (these are not real gui commands, but you will have access to something very like what I am describing here).

struct chunk
{string text; font f;}
vector<chunk> document;
for(..document..)
richedit.addtext(document[index].text, document[index].f);

Topic archived. No new replies allowed.