sofiechan home

New paper claims LLM-powered stylometric de-anonymization? Anonymity is defense-dominant if you want it

anon_danu said in #5176 24h ago: received

We must assume that a motivated attacker will be able to do bayesian-optimal stylometric de-anonymization, and that this will be far more accurate than you might expect. There is no free speech without anonymity, and there is no free thought without free speech, so weapons-grade anonymization tech is existential for any philosophical community that takes itself seriously. We students of philosophy therefore need to think through how strong this could get, and how it could be effectively countered.

https://x.com/alex_prompter/status/2026951395753213970

The paper claims:
>Users who post under persistent usernames should assume that adversaries can link their accounts to real identities.
Whether they claim to get results stronger than that I don't know. We should read the paper. The twitter hypester claims:
>Every throwaway account. Every anonymous forum post. Every “nobody will connect this to me” comment.
But note that's a breathless extrapolation, not directly quoted.

So let's review the full "de-anonymization kill chain":

1. you post sufficient information and stylometric hints in linkable contexts that a bayesian superintelligence could de-anonymize you.
2. the attacker is able to get access to the information in question.
3. the information posted in that context is such as to piss off and motivate an attacker.
4. the attacker is able to economically use de-anonymization tech to identify you.
5. the attacker decides that the deterrence will be sufficiently damaging to be worth the cost of attack.
6. the attacker is able to create a sufficiently convincing case beyond mere allegation to get you cancelled in court of law or public opinion.

These are conjunctive, so defeating any of them defeats the attack. There are all kinds of tactics to play like flooding the zone with false positives and heresy normalization, decreasing your physical vulnerability to dox, using stylometry attacks preemptively against yourself, etc but when I go through these systematically, I find one dominant tactic which injects difficulty in almost every step: fragment your corpus over more identities. You need over 9000 handles.

The keyword is "linkable". If you have one big pseudonym and one big real name identity, it's almost trivial to link them together. The hypothesis space is relatively tiny, even if there are lots of such accounts to be cross-referenced with lots of real names. But by increasing the fragmentation level, you can explode the enemy's hypothesis space beyond feasible inference.

Suppose they attack by embedding each nym in high dimensional stylometry space and looking for clusters or near-neighbors. If error bars are small, they can de-anon. If they are large and overlap many people, they cannot. Error bars will fall with the amount of information available on each nym. A reddit account has orders of magnitude more information and smaller error bars than a 4chan post. By fragmenting across many non-linkable identities, you can drive the error bars up to the point of making attacks infeasible. You dissolve your corpus into impersonal clusters like /pol/, sofiechan, etc.

Coefficients on cost of attack may go to zero in the long run, but cost of bayesian inference grows exponentially with complexity of inference, hypothesis space, number of handles to be linked together, etc. You can drive cost of attack arbitrarily high to encrypt your philosophical radar cross-section. Anonymity, like cryptography, is defense-dominant given appropriate care and technology.

All this to say, this is why we post on anonymous chans, not pseudonymous platforms. There is an orders of magnitude difference in cost and feasibility of attack between disposable fragmented anonymity and persistent pseudonymity. This warrants more careful analysis, but I would bet anonymity remains a strong possibility.

We must assume that received

xenophon said in #5177 21h ago: received

The distinction between anonymity and persistent pseudonymity is important. It's best to assume that a persistent pseudonym could be doxxed at any time, and post accordingly.

I'm skeptical of the ability to identity a low-volume anonym (easily achieved through fragmenting) via stylometric means. One can always just deny the identification and point out that it rests on low-confidence statistics, not hard evidence.

referenced by: >>5178

The distinction betw received

anon_danu said in #5178 20h ago: received

>>5177
>anonyms better than persistent pseuds
he says from behind a known pseud. But yeah, exactly, fragmentation of identities can impose almost arbitrary levels of error and murkiness on attackers.

Its actually an interesting question if we assume an arbitrarily intelligent adversary with lots of computing power, at what identity size (measured in words written under one nym) do we achieve reliable bombproof anonymity? SHA-512 would remain secure, is there an analogy here?

You could measure the maximum information content of typical writing by summing the entropy of each word choice estimated by an LLM base model. Probably many bits per word. But thats upper bound total semantic and incidental information. How much of that is identifying information? A lot less. I would guess much less than one bit per word upper bound. Still the typical paragraph would be sufficient to deanonymize any person on earth (33 bits). However information rate would fall off very fast as you learn in the first few words that the writer is an intelligent westerner and thereafter are hunting for scraps. Assume the falloff model is a power law with power around 1.0. Then information content is logarithmic in length. Every time you half the size of a nym, you get one increment of identifying information. I wonder how many bits per doubling. Probably many, meaning the average sofie nym is quite a bit less doxable than a twitter account.

Once you cross the noise threshold there’s an explosive growth in information content from tying all your nyms together and cross-referencing. This is the “dox” moment. Anonymity depends on ability to not cross that threshold. In the other direction, there is some threshold at which it becomes impossible to dox you. Your identity is effectively encrypted with many bits. Brute force is not feasible because there’s no ability to “check”; even a robust 8 bits of anonymity is immensely anonymous.

If i had to guess, nym size on the order of a few paragraphs is going to remain very anonymous with a little care even under intense scrutiny, unless you are literally doxing yourself, whereas book-length profiles are probably all going to get doxed in the limit if there’s anything to dox (ie any other information that the profile could be unintentionally connected to).

Also worth considering that the “danger” of doxing comes from the cross-referencing of the dox with some specific vulnerabilities like an employer, a “crime”, and something that would motivate the attacker to attempt an attack. If they can’t assemble an actual attack, you’re fine. If they need to assemble two pieces of info, your impunity is the square of your anonymity. If they need to assemble three, it’s the cube. So that makes the anonymity threshold effect much stronger. You go from total impunity to total vulnerability quite quickly if your nyms get too large. At least in the limit where well-funded stasibots are running around.

referenced by: >>5179

he says from behind received

xenophon said in #5179 17h ago: received

>>5178
> he says from behind a known pseud.

I've posted nothing as xenophon that I would mind being tied to my IRL identity, exactly conforming to my advice to "post accordingly."

I've posted nothing received

You must login to post.