Suggestion for PTIO web mirroring / archiving service

Would it be possible for PTIO to offer a mirroring service like https://archive.today? (You will be redirected to one of their other domains.)

On r/privacy we rely on mirroring services to post articles that are worth reading but are hosted on platforms that have trackers and/or soft paywalls.

Archive.today is a great archiving tool that strips out tracking scripts. The problem with archive.today is that it relies on Google ReCAPTCHA to keep automated abuse of the service at bay. Using anything but a Chromium browser in combination with a VPN or Tor means high-risk of encountering the ReCAPTCHA hurdle.

Their FAQ is limited and makes no mention of ReCAPTCHA, but it does have this to say about privacy:

Do you preserve archivers’ privacy? E.g. not disclose the source IP address?

Yes.

But take in mind that when you archive a page, your IP is being sent to the the website you archive as though you are using a proxy (in X-Forwarded-For header). This feature allows websites (e.g shops or the sites with weather forecast) target your region, not mine.

To deal with the problem of automated abuse, I figure the PTIO-version could have users sign in through Reddit, and only accept accounts that are older than a month or so, perhaps with a karma-threshold as well. (I know signing in with Reddit is possible, but I’m having trouble finding any official Reddit page about this feature.)

AFAIK there’s no data-sharing with Reddit, only authentication, and the access can be set to time out after an hour.

In the interest of limiting the amount of work it would take, perhaps archive.today would be open to collaborate on a modified instance of their service; An instance hosted, modified, vetted and endorsed by PTIO.

Thanks for reading. I’m looking forward to hearing your thoughts.

1 Like

Do you have any FOSS software that could be ran for this purpouse? I can only think of IPFS Companion browser extension together with IPFS Desktop or Go-IPFS having this feature, while https://ipfs.io/ipfs/QmVYq4GjAXb6ya1RmV4a7J3AKBc9gWEh5NJKh8nCFBLuYs doesn’t look very working (snapshot of PrivacyTools homepage).

I am also not sure if there would be resources for one considering how heavy @jonah said Matrix media storage or something be a few days ago, but I haven’t came across of statistics on any service like that.

What about people who don’t have a Reddit account?

1 Like

Not my area of expertise, unfortunately. Hopefully someone on this forum will be able to answer that.

Good question, and not one that I personally have an answer to right now. Ideally, we could figure out a way together to address that problem without having to rely on Reddit at all.

I don’t know if Discourse has such a feature? If it does, it would allow the service to rely on accounts of this forum, for example.

1 Like

Interesting idea, we would have to figuire some things out though:

What do we do if someone requests something to be removed?
How much disk space would this take (do we also save files and videos, or only pictures and pure text)
And like pointed out above, authentication.

2 Likes

Another thing to keep in mind is maintenance: Sites change, which means the parsers need to be kept up to date. That’s why I suggested a collaboration; they’ve been doing the labor for years and years already.

Please:

  • is there any reason to not also use the Internet Archive Wayback Machine?

https://web.archive.org/web/20190719063712/https://old.reddit.com/r/savefrom/comments/c7rmso/savefrom_has_been_created/ for example – neither Firefox 73.0.1 (with strict protection), nor Malwarebytes Browser Guard, finds any tracker.

2 Likes

I do use the Wayback Machine as a fallback. There are three differences as far as I know:

  • Wayback uses their own servers to retrieve content, whereas Archive.today uses a user’s own IP to ensure that content that is geo-specific to the archiver is archived.
  • Wayback respects robots.txt, so it doesn’t always work
  • Wayback doesn’t work around soft paywalls, so more often than not, you get a copy of a paywalled page
2 Likes