About file management

Pages: 12

Do you think there's any value in a file management system that lets you have distinct entries in the same directory that have the same name?
I.e. you list the contents of a directory and the system answers
file.txt file.txt
and if you're somehow able to specify which one you want to open, they turn out to have different contents.

zapshe (1834)

Hmm.. No? Why would be the same name? If they're to be differentiated, they should have different names to begin with. How would you even differentiate? Based on file size?

Such a thing is not even possible on Windows. You can only have the same name for a folder and file in the same directory.

Why do you ask?

keskiverto (10365)

You are not talking about case, I presume.
file.txt File.txt
On most commonly used filesystems these are different names.

You do need some unique identifier for each file. The concept of (shown) name not to be that identifier is ... interesting.

TheIdeasMan (6782)

Yes, if they are different revisions of the same file as in git.

On a doc management system I used for work, one wasn't allowed to put the revision in the file name for that reason. But that was RDBMS driven.

I did see once in the 1980's IIRC a VAX system that had revision built in: it appended the revision number in parentheses to the file name.

Merging git like features into an OS file system would be interesting, obviously not as simple as having a separate inode for each file because of the diff file nature of git. But if one could make it look like it does would awesome.

But I wonder if that would be any different to using git from the shell?

zapshe (1834)

You are not talking about case, I presume.
file.txt
File.txt
On most commonly used filesystems these are different names.

Windows 10 will not allow it.

The concept of (shown) name not to be that identifier is ... interesting.

Interesting yea, but I can't see the practical application of it. What's the point? Even if possible, what population would benefit from this?

The first thing that came to mind would be some sort of system which keeps daily logs. It may not want one huge .txt to open every time it wants to open the logs. However, this is a simple fix by numbering them, having a new .txt for every day/month/year as needed. And it would make more sense to have them numbered than try to find some other way to differentiate.

Another thought is just mass data where order doesn't matter, and you just want a system to chug through it. There could be some performance benefits to having smaller txt files rather than one large one if RAM is an issue, but there's still no real reason why they can't be numbered to be different.

I suppose in a situation where you want small files, order doesn't matter, and you've got lots of files to create (where you might run out of possible names for your text files), then this may be of some benefit. But it seems like a rare situation with other ways of dealing with the issue.

coder777 (8439)

That would be another door wide opened for attackers: You think you open one file but actully get another.

What would be the problem solved by this?

If you have a lot of files with the same name (for automatic processing) I would say provide an index or timestamp.

helios (17506)

Why would be the same name?

For example, you merged two directories and there happened to be files with the same name and different contents.

You are not talking about case, I presume.

No, exact same name. They're indistinguishable by just looking at the file name.

You do need some unique identifier for each file. The concept of (shown) name not to be that identifier is ... interesting.

Yes, that's exactly what I'm doing. The identifier is whatever ID the database assigned to the row when it was inserted.

Yes, if they are different revisions of the same file as in git.

No, my system is fundamentally incapable of dealing with file mutability, so file history is meaningless as a concept.

What would be the problem solved by this?

While I understand why they exist, I've always thought the restrictions file systems impose on file names to be too arbitrary, and when I started to design the system I wanted to stay away from any decisions that would make the system unnecessarily inflexible. Actually, I think as data organization systems, file systems are wholly inadequate. Ooh, I can put related files in a directory. Whoop-de-doo!

What I'm working on is an image library system that lets you organize files both by metadata (e.g. author, date, contents) and by a user-defined directory tree (like a regular file system). So, you enter a directory and you see both subdirectories and files, represented as thumbnails.
Actually, after posting the OP I started to realize that file names for files are redundant. If you can already look at the contents in a good-enough preview, the file name adds no new information. At best it gives the system a hint on how to initially sort them, if order is important (e.g. the directory contains scans of pages from a publication).
Even for directories the file name can be made redundant to some extent, because a feature I will implement eventually is the ability to set a "cover" for the directory so you can scan the listing visually. Actually I have a screenshot I made a few years ago of how it might look like (I've been trying to scratch this itch for a good while): https://ibb.co/k31f8kY

It does make paths a bit unwieldy, but with any luck the user never will never have to deal with them:
[1746197,"20200622.jpg",]
["test","1320885.gif",]
(The comma at the end is optional, but serializers always add them. Quotes distinguish file names from IDs. IDs allow referencing a deep directory directly, even if its location in the tree changes.)

Last edited on

keskiverto (10365)

https://github.com/openstack/swift and https://ceph.io/ceph-storage/object-storage/ are object storage systems. I don't think they have any "directories" at all, just "objects".

I can't wrap my mind around the concept -- too stuck with filesystems -- but apparently they are for scaling up/out?

helios (17506)

My understanding is that object storage systems are distributed systems that treat the details of storing a piece of data (where it is physically, how it's stored, how to retrieve it, etc.) as a low level detail and instead present a unified interface for the entire system. So, you request object 42 to IP 10.11.12.13, but actually the object is assembled from data distributed over multiple machines, and later on the data might have been redistributed for whatever reason. They're distinct from distributed file systems in that they store structured data, not arrays of bytes.

My system is nowhere near that sophisticated. It's just an SQLite database with a couple of secondary databases to store large blobs, plus a few programs to access the data.

Last edited on

zapshe (1834)

I love Gantz! Anime ending was total BS though :(

One issue right off the bat is that many people already have many files with the same names. "READ ME", "Game.exe", "General.hpp" (Idk where that came from, but apparently I have two of them), etc..

This is not to mention the files that are exactly the same but located in different directories - like when you download something in your downloads folder and make a copy of it somewhere else. For me, I keep a backup of my files on my computer (so that making a backup from an external drive easier), so I have many duplicate files.

Many times when looking for a file, there's very little you actually know about it aside from it's name. I think that letting the user have files with the same name in a single directory would be chaotic.

In a server situation it may make more sense, since the system will probably not be arbitrarily looking for a file and will likely have the secondary information needed to differentiate.

helios (17506)

This is not to mention the files that are exactly the same but located in different directories

That's one of the annoyances I'm trying to solve. The system already keeps track of file identity by SHA-256 digest, so exact binary duplicates are impossible. If I try to add a file that's already "known", the system just links both file entries to the same data. In effect, all files that appear in the hierarchy are just links.
I plan to also implement approximate duplicate searching, but that's much more compute-intensive. I wrote about different methods I researched almost nine years ago: http://www.cplusplus.com/forum/lounge/47826/
(Like I said, I've been trying to scratch this itch for a while.)

Many times when looking for a file, there's very little you actually know about it aside from it's name.

Nah. If you know the name you already know plenty about the file. When people search things, like books in a library, they aren't searching by name, they're searching by content. Anyone can index a library by titles; it's called an inventory. The real difficulty of archival and indexing is finding efficient ways of finding objects by what they contain. An ideal index is one that you tell it a vague description of what you're looking for and it instantly brings you a handful of items that best match your description. The only practical approximation of this that I've found for images is a tagging system. The fundamental problem is that we have easy ways of entering text and of comparing text, but no easy ways of extracting representative text descriptions from image data. There are computer vision methods where you draw a doodle and the system searches for files that match the general shape, but I've found them to be of dubious effectiveness and I've never seen them demonstrated on large collections, not to mention some images contain just too much (or too little!) detail.

I think that letting the user have files with the same name in a single directory would be chaotic.

I think an organization system's purpose is to assist the user, not to impose any particular organization method on the user. There's no way to prevent someone from being disorganized, file names or no. Ideally you want a system that works so well that you don't even realize it's doing anything.

zapshe (1834)

That's one intense itch.

A tagging system can benefit from a thesaurus.

It may not be ideal, but you can have it so that images require unique names. The metadata from an image can only help if you're going to separate them by downloaded vs photographed and time taken.

I think an organization system's purpose is to assist the user, not to impose any particular organization method on the user.

There's limitations with that though. A good OS generally will not want the user to be able to hurt themselves. Macs take this to the extreme. It sounds like a neat idea, but I just can't find the practicability behind it.

helios (17506)

A tagging system can benefit from a thesaurus.

Meaning...?

It may not be ideal, but you can have it so that images require unique names.

And what would that solve? The system already has unique "names": the database IDs.

The metadata from an image can only help if you're going to separate them by downloaded vs photographed and time taken.

That's useless, since a lot of formats don't have metadata, and anyway EXIF only tells you about the context of an image (how, when, and where it was taken), not about the content, although occasionally people do fill the tag field, which doesn't help if the tags are irrelevant to you or if they don't fit into your system, which they almost certainly are and don't.

There's limitations with that though. A good OS generally will not want the user to be able to hurt themselves.

That's a fool's errand. It reminds me of an article I read a few days ago about an MMO type-thing Disney wanted in the 90s.
http://habitatchronicles.com/2007/03/the-untold-history-of-toontowns-speedchat-or-blockchattm-from-disney-finally-arrives/

"OK. That means Chat Is Out of HercWorld, there is absolutely no way to meet your standard without exorbitantly high moderation costs,"

"Couldn’t we do some kind of sentence constructor, with a limited vocabulary of safe words?"

"That won’t work. We tried it for KA-Worlds. We spent several weeks building a UI that used pop-downs to construct sentences, and only had completely harmless words – the standard parts of grammar and safe nouns like cars, animals, and objects in the world. We thought it was the perfect solution, until we set our first 14-year old boy down in front of it. Within minutes he’d created the following sentence:

I want to stick my long-necked Giraffe up your fluffy white bunny.

Then there's the story at the end, where users figured out a protocol involving moving furniture around to communicate secret codes so they could use unrestricted chats with each other.

My point being, the only way to keep people from hurting themselves with their tools is to give them tools so restricted that they're useless, like a styrofoam hammer, or a rounded screwdriver.

It sounds like a neat idea, but I just can't find the practicability behind it.

You're just too habituated to file systems to challenge their basic assumptions, like "files have names" or "files have exactly one parent".

zapshe (1834)

That's a fool's errand

It is, but there's a difference between trying to get rid of all ways a user can hurt themselves and giving a user a live grenade and letting them figure it out.

You're just too habituated to file systems to challenge their basic assumptions, like "files have names" or "files have exactly one parent".

Maybe.

helios (17506)

It is, but there's a difference between trying to get rid of all ways a user can hurt themselves and giving a user a live grenade and letting them figure it out.

That's an interesting viewpoint for someone who's a regular in a C++ forum to express. One would think a C++ programmer of all people would prefer having the option of a hand grenade to the safety of not having one.

I mean, if an adult is too stupid to not handle a grenade responsibly, they were probably going to eventually find some other retarded way to blow themselves up. Probably something involving gasoline and their own asshole, or something equally embarrassingly moronic.

zapshe (1834)

One would think a C++ programmer of all people would prefer having the option of a hand grenade to the safety of not having one.

It's not a one way street. On my computer, I love to have all the bells and whistles, every grenade I can have. I hate Macs for this very reason. However, my phone is an iPhone. I like my phone to be simple, have my emails, texts, calls, and apps that I like. I don't wanna deal with anything "extra" with my phone - so an iPhone's lack of options compared to Android phones doesn't bother me (though I hate Apple as company).

I mean, if an adult is too stupid to not handle a grenade responsibly, they were probably going to eventually find some other retarded way to blow themselves up. Probably something involving gasoline and their own asshole, or something equally embarrassingly moronic.

You know, there's a part of me that wishes this was true. However, you can easily bump into several people a day who ought to be labelled mentally retarded but are not only alive and in one piece, but somehow manage to hold down a job.

Your idea sounds cool but impractical, just because I don't see a need for it. If it ain't broke and all.

keskiverto (10365)

zapshe wrote:
I can't imagine myself using ...

zapshe wrote:
impractical, just because I don't see a need for it.

Teenagers tend to have that view, but they can grow.

helios (17506)

I don't see a need for it

Oh, well, that's a different matter altogether.
Yeah, almost nobody needs something like this. Most people nowadays just stream their content. They don't have multi-drive file servers to store all their bits, or large media collections. They've outsourced their need for organization to some web server out there.
If you've never had to organize -- and especially search through -- a large media collection, you can't know what problems need solving, what solutions would help, and what things don't matter.

If it ain't broke and all.

Well, that's just veering into insult territory. Do you really think I'd expend all this effort if I could just work around the problem using traditional file systems? I tried it. I tried making it work with symlinks, hardlinks, all that crap. File systems do not want to be used this way. When you start to do that you run into other problems, like "wait a second. How the hell do I back this up??", or "the file browser handles this particular type of link in a way that's almost pointless*", or "is this a link or is it the actual file? Am I breaking something if I delete or move this?"

* Man, I just remembered. A few years ago I'd written a hook for Explorer that would intercept certain file attribute calls and make it think it was looking at a file rather than a symlink. The bastard a) was not rendering thumbnails for symlinks, and b) when I tried to open the file it would give the associated program the path to the target, not to the link, so if I moved to the next file within the program it would display a completely different file than expected. So annoying.

Don't get me wrong. Like I've said, I understand why these dumb problems arise. File systems are trying to solve a different problem. What allows me to massively simplify my solution is my assumption that file contents never ever need to change. This is a valid assumption for an archive, but not for the basic infrastructure of a computer system. If you copy a file and modify the copy, you don't want the changes to be applied to both copies. If you start from the assumption that the file will never change, then you never need to make copies, only links. You can rethink the entire problem.

jonnin (11333)

This has a place.
Someone already mentioned version control where the same file is kept with different contents under the same name. Which you can do with an interface to a normal file system by adding a date or something to the names and stripping it off in the interface, or by poking the files in a database or something that isnt a real file system so name does not matter, either one works.

It is perfectly doable with a normal file system, of course. A filesystem is typically (dumbed down here) just a lookup table of 'this is the file's name, and its data is on the disk at this address' using one scheme or another. If you can get to the file data at whatever disk address/location, you can get to that file. So instead of storing by name in the table, just have file 1 @ location 0, file 2 @ location 1000, whatever info, and the name becomes 'tag along' data that does not 'define' the file, but lets the user 'locate' it. That all works but then the use types ls and gets the same file name 5 times, how to know which? That can be ok in unix, and to a very lesser extent windows, via extending the normal attribute flags or something (this one is file that is executable, that one is file that is code, see the execute flag?) ... point being that its not an insurmountable task to build such a thing even on a conventional OS/disk setup. The main issue here is user friendly problems -- users can barely handle name.extensions and would have head melting freakouts if they had to deal with the same name with no extensions in a folder. And my idea would be limited to one file per attribute, or you need some other scheme to differentiate which one you want -- may have to allow a power of 2 attributes, and let the user define the majority of them, and probably at least a 32 bit marker here... again, not insurmountable, but you really gonna have 1 million files each of a different 'attribute/marker' all with the same name in a folder?
It would also blow up conventional tools -- zip tools and other such things may not be able to handle this, would need a re-write, as would your file search tools etc. Its a total do-over of a lot of things to build a system that would drive humans nuts trying to use it (a computer using such a thing may have some merits, though).

helios (17506)

users can barely handle name.extensions

I don't agree with this. A user typically only cares about the general type of file they're looking at. Is this text? Is it video? Is it audio? Is it an image? Something else? I'm starting from the assumption that all the files I'll deal with are images, so extensions (which honestly are a bit of a hack anyway) are redundant. The system can keep track of the particular file format, determined by parsing the file, and use it when necessary without the user being concerned with it.

It is perfectly doable with a normal file system, of course. [...]

You went way off course after this. The problem I'm trying to solve is how to efficiently (in human terms) organize and search for data. Yes, I could figure out some utterly contrived way to somehow making it work "seamlessly" with existing file management tools, but why would I want to do that? All I need is some program or combination of programs that let me manage and access the data. Whether that program is the one provided by the system or something else, I don't care, so if interfacing directly with existing facilities just creates more problems then there's no point in doing it. I'll just write my own storage system, my own file browser, my own file viewer, and so on, which is what I've done. Was it a lot of work? You bet your ass, and I'm still not done! But by doing it like this I don't have to resort to weird hacks and workarounds that may not even work in the future. My only constraint is what I can imagine.

Last edited on

Pages: 12

C++

Forum

About file management