Private Set Intersection for Web Content Anonymization

How multiple parties can collaborate to anonymize web content that's interspersed with PII.

#crawling, #scraping, #archiving, #digipres, #anonymization, #sanitization, #journalism

See https://docs.sweeting.me/s/cookie-dilemma for background context.

HTML-Based Intersection

+ ➡️

Image-Based Intersection

+ ➡️

Pre-Requisites

Two nodes that have both independently archived some page (e.g. a facebook post) while logged in with their respective accounts (ideally around the same time, with the same browser, language, font settings, and light/dark mode).

Both WARC/HTML/PNG captures contain the main content they were trying to capture (photos, comments, etc.), but it's mixed in with unsharable PII specific to their individual accounts (e.g. their name, profile picture, email, session tokens, private notifications, recent DMs, etc.).

The Goal

For the client to produce a final modified version of the content that contains only the intersection of the bytes that both nodes share, without ever revealing the cleartext to the server. The final copy should effectively be "anonymized" becuase it will exclude any bytes that are specific to either user (e.g. their PII). The client should be able to repeat this process to multiple servers, to further anonymize the content and should be able to be increasingly confident that nothing within will reveal their identity (hopefully) or their cookies/auth tokens (definitely).

This allows you to build a whole new digital ontology for human "perspective". You can start to cluster and intersect groups of people's perspective on websites and see how its content is rendered differently over time to different groups, without giving away the individual identity of everyone contributing to the public archive. This fixes the issue of traditional archiving tools struggling to archive private content (e.g. discord, facebook groups, whatsapp channels, etc.), because it requires login it used to force the archivist to burn their credentials everytime they share warcs. With good PSI tooling we can arrive at safe(r) anonymized versions and share them more freely, increasing the immediate and long term value of archiving.

The goal now becomes how do you manage identity in this system so that pairs of people can trust each other enough to go through the PSI process? And ideally how do you reward them for doing that labor (without inviting copyright lawsuits).

Who pays for hosting of the non-anonymized and anonymized captures, and who responds to DMCA notices and subpoenas?

Also how do you collect, tag, curate, and swap bundles of this content between institutional servers (including governments, law enforcement, lawyers, journalists, etc.).

Quickstart

git clone https://github.com/pirate/html-private-set-intersection.git
cd html-private-set-intersection

# make sure bun is installed
curl -fsSL https://bun.sh/install | bash

# install the npm dependencies
npm install

# on node1 run the server
./psi.js --server --file test2a.html --reveal-intersection

# on node2 run the client
./psi.js --client node1.local:5995 --file test2b.html --reveal-intersection --highlight

# on node2 save the output as redacted html that can be viewed in a browser
./psi.js --client node1.local:5995 --file test2b.html --reveal-intersection --redact > out.html
open out.html

# find the intersection of images instead of text
./psi_image.js --server --reveal-intersection --file version_a.png
./psi_image.js --client localhost:5995 --reveal_intersection --file version_b.png
open ./psi_output.png

# try the demo UI WebRTC P2P PSI In-Browser
cd ui/
npm install
node server.js &
npm run dev

Threat Model

Nodes should only attempt to anonymize with other trusted peers. The output of the PSI between two trusted peers is a result that is then safe(r) to share with untrusted peers. It doesn't protect against de-anonimization, but it does protect against people stealing your cookies / auth tokens fairly well. The PSI process itself should never be attempted directly between untrusted peers, especially for images.

How does it work? The hang-man attack. An adversary can send 26 screenshots of the facebook.com homepage with the logged-in user's name in the upper left replaced with all aaaaaaaaaaaaaaa, bbbbbbbbbbbbbbbb, ccccccccccccccc, ddddddddddddd, etc. After only 26 screenshots they can see what every letter in every position is, because they're looking for matches in parallel! It's incredibly easy compared to bruteforcing the entire name at once. It's even worse if the malicoius peer has any inside knowledge as to who the other peer might be, as this narrows down the search space and they can just spot-check specific values. It gets harder the larger you make the tiles because eventually each tile contains multiple letters or words.

+ ➡️

Mitigation: paranoid peers can increase their tile sizes from 5px to ~200px to cover entire words & sentences so that this attack is much harder.

Images

Adversary generates images that look like the info they want to test for (e.g. your name, email, profile picture, most recent notification timestamp, etc.), if you confirm the presence of that info, they know it must be you and they can send you to jail for whistleblowing, copyright violation, etc.

HTML

Adversary tests for words in the html e.g. first name, last name, email. Or they can convince you to archive a malicious page that embeds some text that they later test for, this allows definitely proving the identity of the archivist without a shadow of a doubt.

The solution to all of this is to just manually review the output, or have defense-in-depth using burner accounts for archiving and semi-automated review of PSI output before sharing.

Beyond direct token attacks, there's also the issue of PSI being possible to de-anonymize by just checking for set intersection with another dataset. There is no technical defense against this, just defense-in-depth with the other techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
ui		ui
README.md		README.md
example_fb_output.html		example_fb_output.html
example_hn_output.html		example_hn_output.html
example_image_output.png		example_image_output.png
package-lock.json		package-lock.json
package.json		package.json
psi.js		psi.js
psi_image.js		psi_image.js
test1a.html		test1a.html
test1b.html		test1b.html
test2a.html		test2a.html
test2b.html		test2b.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Private Set Intersection for Web Content Anonymization

HTML-Based Intersection

Image-Based Intersection

Pre-Requisites

The Goal

Quickstart

Threat Model

Images

HTML

Further Reading

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Private Set Intersection for Web Content Anonymization

HTML-Based Intersection

Image-Based Intersection

Pre-Requisites

The Goal

Quickstart

Threat Model

Images

HTML

Further Reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages