Copyrights and intent

If I use CURL to download a webpage and save it to disk for the sole purpose of comparing it to that same webpage one hour later in order to indicate to the user whether something has changed (like a new post on this website's forums for instance or perhaps new books available on gutenberg.org) is that going to run into copyright issues?

I know that making a copy of a text/website for redistribution to others is illegal, but how about making a copy of that same text/website in which the user is not able to access anything except for the changes since the last download (in which only the machine has any knowledge of the full text)? Or perhaps only for the purpose of telling the user to go to that website to view the changes?

Would it magically make it legal by saying that the content shown to the user is the sole ownership of the domain to which the download came from...(Like a lot of youtubers do when they make their fan-videos)?

Last edited on
If you just need to know if a page has changed and want to ignore legal headaches, consider this. There exist mathematical functions that, given two bit strings S1 and S2, the following statements apply:

* if S1 == S2 then f(S1) == f(S2)
* if f(S1) != f(S2) then S1 != S2
* there exist some pairs of bit strings T1, T2 such that T1 != T2 and f(T1) == f(T2)
* the output of f() is a constant-sized bit string

If f() has all these properties, f() is known as a "one-way compression function", or "hash function". Common hash functions useful for identifying files are: MD5, SHA-1, SHA-256.
You can download the file, apply the hash function to it in memory, and just save the output of the function. Most hash function implementations even have a stream-like interface, allowing you to hash bit strings larger than memory by processing the input one chunk at a time, so you don't even need to hold the entire page in memory at once to compute the hash.
When you want to see if the page changed, you download and hash again. If the hash is different, you know the page definitely changed. If the hash is the same, there's a very, very small chance that it has changed (e.g. for SHA-256, lower than 1 in 1077).

EDIT: To put the number 1077 in perspective, if I had a random 256-bit number and asked you to guess it, and you tried to guess 256-bit numbers at random at a rate of 1 per Planck time (essentially the "tick" of the universe, or 10-44 seconds) it would take you 1.3 quadrillion times the current age of the universe to have a 50% chance of guessing.
Last edited on
In most (if not all) countries, there is no copyright violation if web content is temporarily stored in a web cache.

For instance: "In 1998, the DMCA added rules to the United States Code (17 U.S.C. §: 512) that relinquishes system operators from copyright liability for the purposes of caching."
https://en.wikipedia.org/wiki/Web_cache#Legal_issues
closed account (48T7M4Gy)
Interesting.

http://fairuse.stanford.edu/overview/website-permissions/websites/

You might and you might not be infringing the owners copyright. The uncertainty of that answer is what makes good court cases. (At least for the lawyers.)

If you are storing the material on your cache then probably you won't get caught, so who cares.

If its's fair use then there is probably no infringement.

But, by announcing specific changes to the world, if that's what is happening, then it depends. If it's a revised URL probably no infringement. If it's detailed word for word changes beyond fair use quotation it probably is something like redistribution of the owners material and if that is against their terms of use then watch out. Newspaper bloggers know all about that.

So it depends what you do with the cached material - quoting it verbatim would be perilous as you already stated.

All of this would also be tempered by how many dollars are involved. If your 'product' is a large money spinner or adversely affects the copyright owners pocket and market then expect a knock on the door, fair use or not. NewsCorp have deep pockets. :)

PS I'll probably get a letter from Stanford.
Last edited on
Thanks kemort, that's definitely my problem, the more convenient I make the program for the user, the more likely it is to cause some infringement. I can't store it simply on my programs memory cache (in ram which I think is actually considered legal) as I would need a more permanent copy for comparison after the device is turned off and on again. It's always a programmer's dream that they make a cool program that makes them a million dollars, so of course I want to protect my hypothetical money before I become too devoted to the project.

I like helios's idea of using a hash function, I haven't done much as far as compression but it is an area I've always wanted to play with. This sounds like the best way to avoid copyright infringement all together. It just feels so limiting for the program. I wouldn't even be able to tell the user how many new posts were available.

As far as section 17 U.S.C. §: 512, JLBorges, the way I was going to go about it would definitely have modified the source (though now that I think about it, a couple of changes to my order of operations and that would be fixed), and it would not have been stored on the computer by a third-party... I think the main reason this wouldn't work for me is that it protects the server-side and not the client-side programs. I will look further into these "safe harbors", that's just what I got after a short perusal, https://en.wikipedia.org/wiki/Online_Copyright_Infringement_Liability_Limitation_Act

Perhaps there is some loophole if I could say that the program is a browser. Technically I would not have fore-knowledge of which URLs the user types in for monitoring. I do know that lynx (a linux text-only browser) allows for the user to download any page they are currently viewing by hitting 'd', how is that legal? Even firefox and I think all major browsers have a save-page option that saves a permanent copy, what's their loophole? Any guesses there?
Last edited on
I would guess that making a program capable of performing functions that could in certain circumstances result in breaking the law would not itself be illegal; it would only be illegal for the user to use it in those circumstances. For example, making a BitTorrent client is not illegal, but using one to download copyrighted films is.
Helios, that makes sense, though I don't like the idea of shifting the blame down the line to the user. I guess those license agreements when you first launch a program would be where I'd put full disclosure that a hard copy is saved to their device and to beware of copyright infringement. Yuck.
Topic archived. No new replies allowed.