Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

if the Permanent Web is “content-addressable”, could it be designed so that each file has only one address? #126

Open
8 tasks
Mithgol opened this issue Apr 15, 2016 · 9 comments
Labels

Comments

@Mithgol
Copy link

Mithgol commented Apr 15, 2016

Recently at ipfs/kubo#875 (comment) I have once again encountered the following fact: 

  • one file might have several different IPFS hashes,
  • if several files have different IPFS hashes, then these files still might be the same (content-wise),
  • the result of ipfs cat $HASH | ipfs add is not always the original hash

(it's all the same fact, just rephrased). Apparently there are several factors (encoding, sharding, IPLD) that influence the IPFS hash and make it different even if the file's contents are not different at all.

I have then found myself questioning whether the current IPFS URIs (designed in ipfs/kubo#1678) are useful in hyperlinks of the Permanent Web. After all, if a hyperlink references an IPFS hash, then that hyperlink by design becomes broken (forever) once that hash can no longer be used to retrieve the file. Even if someone somewhere discovers such lost file in some offline archive and decides to upload that file (or the whole archive) to the Permanent Web, the file is likely to yield a different IPFS hash and thus an old hyperlink (which references the original IPFS hash) is still doomed to remain broken forever. Such behaviour is not okay for the Permanent Web.

What can be done to improve it?

After a few minutes of hectic googling I've encountered @btrask's design of “Common content address URI format” which uses URIs such as hash://sha256/98493caa8b37eaa26343bbf7 that are based on cryptographic hashes of the addressed content. As long as the hash (“sha256”) stays the same, each file has only one address.

In addition to its main advantage (the improved immutability of the addresses), it also has a couple of additional advantages:

  • Different content-addressable P2P-distributed storages can use mutually compatible URI.
  • Well-known cryptographic tools can be used to calculate hashes (and subsequently generate addresses) of files even if an implementation of a specific P2P-distributed storage is not readily available. For example, consider SubtleCrypto.digest available in some Web browsers before JS IPFS is completed.

Therefore here's a proposal: implement such addressing on top of IPFS to ensure that each file has only one address (minor correction: “only one” until multihash is upgraded from sha256 to another algorithm and changes the address inevitably), an address that is determined only by the cryptographic hash of the file's content.

As an address @btrask's scheme of hash://algorithm/hashString is too long and also not similar to the other IPFS addresses. I propose the form /ipmh/hashString where hashString is a base58-encoded multihash of the file's content (not of the file's merkledag!) and ipmh means “InterPlanetary Multihash”. It's better to refrain from the idea of /iphs/ (“InterPlanetary Hash System”) because iphs and ipns are visually alike (their likeness might cause perception errors in OCR and human vision).

I am certain that an implementation won't be an easy task and would need at least the following:

  • some DHT to track a multitude of IPFS hashes that correspond to cryptographic IPMH hashes
    • Note 1. These are currently sha256 hashes. Such (or similar) DHT would also eventually be necessary to find new (upgraded) multihashes that correspond to the current sha256 multihashes.
    • Note 2. If a system starts with /ipfs/ address which it cannot resolve (the “forever dead hyperlink” case, discussed above), it should try using DHT backwards (to find an IPMH for such IPFS) and then use IPMH to look for equivalent IPFS hashes (where “equivalent” means that they designate the same content as the original IPFS).
  • changes in ipfs add to ensure that /ipmh/ addresses are issued by default
  • changes in ipfs get and ipfs cat to ensure that /ipmh/ addresses can be used to retrieve files
  • changes in ipfs mount to ensure that /ipmh mountpoint is mounted
  • similar changes in other commands
  • changes in the main gate to ensure that https://ipfs.io/ipmh/ addresses are served
  • similar changes in the local gates listening on /ip4/127.0.0.1/tcp/8080
  • changes in Firefox addon and Chrome extension to redirect https://ipfs.io/ipmh/ addresses (and, optionally, also @btrask's hash://sha256/ addresses).
    • Here “optionally” means that hash:// (unlike ipmh:// or https://ipfs.io/ipmh/) is not necessarily IPFS-related and thus the user might want another application (such as StrongLink) to handle it. (Such ambiguity is similar to the case of magnet: hyperlinks.)

However, it really seems that there's no other way to make the Permanent Web more permanent, to prevent dead hyperlinks from staying dead.

(Everything that is said here about the files can probably be also said about IPFS directory structures; but I am not sure.)

@MichaelMure
Copy link

I don't really see the problem. In my understanding, here is what could alter the final hash of a file:

  • chunking: this is how the file is broken down in chunk. Currently there is two chunker: a basic fixed-size chunker and the rabin chunker. The later is designed to find common chunk in two slightly different files to have a better de-duplication
  • graph layout: this is how block are gathered together, eventually in several layer. Currently there is two layout available: a balanced layout and a trickle layout designed for streaming. This will affect the final hash, but not the hash of each chunk.
  • hashing: the hash function used to generate the hash. It's likely that no more than 2 or 3 hash function will co-exist as the default will change only if the crypto could be broken. Note that if #875 is implemented, a file could be published with several hash function at the same time without too much overhead.
  • encoding: if the file encoding change, it's effectively another file and it need to have another hash.
  • IPLD: not sure here, but I think it will only alter the metadata block. After the change to IPLD, old block will cease to exist and it won't be a problem for long.

All in all, I don't see this being too much of a problem.
At the same time, if you only use the file content to derive the final hash, you will need infrastructure to map this hash to an actual Merkledag hash. You will also need to decide which Merkledag you want (rabin+trickle ? balanced+fixed-size ?) and this combination might not even exist in the network.

Actually, the biggest problem would be that you can check the file content hash only when you have all the block on disk, rather than check each block individually.

@hackergrrl
Copy link

Hey @Mithgol. I totally grok your line of reasoning, but I also believe IPFS offers exactly what you're proposing already plus the natural next step.

Therefore here's a proposal: implement such addressing on top of IPFS to ensure that each file has only one address (minor correction: “only one” until multihash is upgraded from sha256 to another algorithm and changes the address inevitably), an address that is determined only by the cryptographic hash of the file's content.

You can already do this:

$ echo "hello world" | ipfs block put
QmZjTnYw2TFhn9Nn7tjmPSoTBoY7YRkwPzwSrSbabY24Kp

$ ipfs block get QmZjTnYw2TFhn9Nn7tjmPSoTBoY7YRkwPzwSrSbabY24Kp | xxd
00000000: 6865 6c6c 6f20 776f 726c 640a            hello world.

At the block level, objects are exactly the mulithash of their content. You can totally 100% use IPFS this way. Things like ipfs protobuf objects and IPLD and chunkers are simply serialization formats on top of blocks. They are something you can opt-out of and happily use raw blocks whereever you'd like.

As you work this way (with raw blocks) you may begin to lament the lack of more powerful linking data structures (like IPLD for linked data) and the advantages that different chunkers bring. How could you add these without modifying the hash? Say I had sintel.mp4 and wanted to do chunking underneath but keep its original hash.

It's not a crazy idea: it'd be really nice if QmSiiwxEyyuZggNX7aVQzekoWuDjAWB61dxqQ2wWiGLqSq referred to sintel.mp4 regardless of how it was chunked or what links it had! But.. how do you choose which chunked version you'd like? You probably want a trickledag chunker if you're streaming the video, but maybe just the whole thing as a raw block is also acceptable. How would you specify to the network which chunking mechanism you want your video backed by? And how about linked data? If we referenced only the raw video data how could we get integrity checking on its outgoing links?

The natural result: you begin to hash the linked data + chunks instead of the raw data, and refer to that by its hash, which is exactly where IPFS is today.

@btrask
Copy link

btrask commented Apr 16, 2016

First of all, many thanks @Mithgol for clearly articulating this problem!

To be clear, I have no desire or expectation for IPFS to switch to my hash URI scheme. IPFS is focused on paths (and the file system model in general), and regardless of my personal preferences, I think it's a valid idea worth trying. And, after all, the particular string format doesn't really matter (as long as programs can understand it).

What does matter, of course, is the way hashes are computed. I'm afraid I've been a constant thorn in @jbenet's side on this issue (sorry!). See our back and forth ipfs/kubo#1953 and the resolution we eventually came to #89.

Recently I've been working on a very simple project called the Hash Archive (https://hash-archive.org) (BTW it's still unstable, so please don't spread that link too far...) that builds a mapping of hashes to web URLs. I realized just the other day that the exact same system could be extended to IPFS (and, incidentally, BitTorrent). So while IPFS continues to focus on its own merkle-DAG hashes (for the reasons @noffle explains), it may eventually be possible to build a mapping system of file hashes on top of it.

To me, this seems to fit with IPFS's desire to be an infrastructure component. Things can and will be built on top of it to make it more user-friendly and interoperable.

That said, I would suggest being careful about embedding IPFS paths in files intended to be "permanent," since those particular hashes may be long gone, even if IPFS is still around.

@btrask
Copy link

btrask commented Apr 16, 2016

Also, BTW, I asked @jbenet about using raw IPFS blocks to store files the way @noffle mentions, but he pointed out that blocks have an (IIRC) 4MB limit. So while I wish that option would work, it doesn't really.

@jefft0
Copy link

jefft0 commented Apr 16, 2016

But using a hash of the raw file would work for a great number of cases where the files are small enough, and would be a good convention to support even though it doesn't solve every case. For example, most source code files are small enough to link by the raw hash (for an IPFS code repository system).

@daviddias daviddias added the IPLD label Aug 28, 2017
@mitra42
Copy link

mitra42 commented Sep 14, 2017

Excellent discussion and timely in our discussion of how to make files from the Internet Archive collections available in a world where IPLD is non deterministic (same file, different IPLDs).

@Kubuxu
Copy link
Member

Kubuxu commented Sep 19, 2017

@jefft0 this is an option ipfs add --raw-leaves will use straight up hash if file is smaller than 256KiB.

@MarkusTeufelberger
Copy link

This /ipmh/ approach assumes that current cryptographic hashes are never broken though...
hash://sha1/38762cf7f55934b34d179ae6a4c80cadccbb7f0a for example is no longer just pointing to a single file since https://shattered.it and sooner or later the same fate might hit other algorithms.

Maybe there should be support for an arbitrary number of hashStrings in arbitrary order after /ipmh/? This still could result in some wrong data being downloaded (e.g. if you download files by looking up a CRC32 of 0x12345678), but it would at least be possible to know which hash was useless. You'd know for each piece which ipfs hash was used to download it and you'd know for each ipfs hash which multihash corresponded to it. That way you'd quickly learn that you got a file with the correct CRC32, but the incorrect SHA3 for example - so SHA3 is more exact than CRC32. This means you drop all CRC32 ipfs hashes that are not referenced also by other, still trusted hash functions unless all of these don't result in any hits.

The issue still remains that there needs to be a location where the mappings for Multihash --> [IPFS hashes] is stored, it needs to be world writable and readable. Also ideally there must be some way to ensure that vandalism (insert bogus hashes) or sybil attacks ("I'm 5000 users and we all think this bogus hash is valid!") are not possible or at least not rewarded, while somehow not making it necessary to have a central "blessed" authority that 24/7 downloads data (something which might very well be illegal) and verifies hashes to resolve conflicts or hand out certificates/signatures.
This makes already the very first bullet point ("some DHT...") quite interesting to reason about.

@mitra42
Copy link

mitra42 commented Sep 29, 2017

One part of this solution, which I've been suggesting is that the IPLD as produced by IPFS should contain the hash of the content as well when it is known. That way someone who built up a document from a list of shards would know whether they had the right final result.

Of course, it would be great if there was a canonical IPLD spec, so we'd all know where to put this hash but I'm told that isn't part of the plan. :-(

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants