sofiechan home

Designing a word frequency auto-tagger for sofiechan

admin said in #3250 7d ago:

Let's talk about the upcoming auto-tagger architecture. All it has to do is produce affinity scores between threads and tags, and our now existent smart tag selector will select a high-affinity set of tags that also make a good index. To do this, it's going to consider word frequencies. Word frequencies that match the distribution seen in a particular tag get a high affinity score with that tag.

The hypothesis is that at least some word frequencies are significantly correlated with tags. If any words like "lesswrong" "sofiechan" "god" etc are correlated with any tags like "rationality" "meta" or "theology", then we should be able to automatically measure that and use it to add information to the tag system unless we are utterly incompetent at statistics.

I tried various things and was defeated and humbled. However I persevered and now have a working similarity metric that is pretty fast and correctly guesses the thread of held-out test-set posts 60% of the time, and the author 20% of the time. That author result is crap considering how much I post here, but the thread result is much better, and more relevant to tagging. I think it's going to work!

The metric I've settled on is a fairly standard cosine similarity over TF-IDF frequencies (term frequency times negative log document frequency), with the slight modification that I do presence-conditional normalization of the term frequencies. "Term frequency" is how often the term in question appears in the post in question. "Document frequency" is what fraction of posts the term appears in. The "inverse document frequency" factor is the negative logarithm of document frequency, representing the information content (in nits) of the presence of that term. Presence-conditional normalization is when we normalize (divide) term frequencies by the frequency of that term in other documents that contain it, so that all terms have 1.0 frequency on average where they appear, and normalized term frequency measures relatively how much it appears. Presence conditional normalized tf-idf works slightly better than raw idf and is theoretically more correct IMO, while term frequency alone barely works because very common low-information words dominate the metrics.

Cosine similarity measures symmetrically how "close" two frequency distributions are by computing the cosine of the "angle" between them in their high-dimensional space. The nice thing about cosine similarity is that zeros make zero contribution so it can be super fast on sparse vectors (like our per-post term frequencies). We measure "similarity" between posts or threads and the bulk statistics of whatever category we are sorting them into. We're going to need something quite fast for the rapid iterations of measuring all threads against all tags that we're going to do.

The actual tag machine wants input in the form of thread-tag affinities that behave something like log likelihood ratios. We can get those by normalizing our cosine similarities into z scores for each tag or something like that. I haven't done this yet so it will take some experimentation. Log likelihood ratios of tags given the words in a thread are a perfect input for our tag selector. We combine them with information from user judgements, and feed it to the tag selector to pick a size-diverse non-redundant set of the highest affinity tags to be the actual tags of any given post.

We only need this to work well enough in practice to roughly but thoroughly imitate our tag preferences, but I think it will work better than that. If we iterate this process many times, the tag set will theoretically mutate on the margin to be both more predictive of post content and a better index from a structural perspective, while respecting our expressed preferences.

If this works, we're not too far off from having robot slaves to optimally organize our superintelligent discourse here on sofiechan dot com.

referenced by: >>3268

Let's talk about the

anon_tona said in #3251 7d ago:

Good work!

If you want to get those accuracy scores up, the next step would be to convert from TF-IDF to embeddings. You'd replace the sparse TF-IDF vectors with dense vector representations generated by transformer-based models. Libraries like Hugging Face Transformers and Sentence-Transformers make it easy to compute these embeddings for both documents (precomputed) and queries (at search time). Similarity is still measured using cosine similarity, but now over dense vectors. For scalable search, tools like FAISS, Annoy, or ScaNN enable efficient approximate nearest neighbor lookup. Optionally, you can enhance retrieval quality by re-ranking results with cross-encoders or combining embedding-based and traditional keyword methods in a hybrid approach.

referenced by: >>3252

Good work!...

admin said in #3252 6d ago:

>>3251
Yeah i originally was going to use embeddings but all the off the shelf embeddings were confusing and huge, and i failed when i attempted to train one myself. I may revisit but i suspect this keyword system is going to be fine for our purposes. And the sparsity is actually nice because the cosine similarity (or hellinger similarity) gets zero contribution from zeroes so the complexity is proportional to activated dimensions not total dimensions. Just for fun i may later attempt to train a custom sparse nonnegative low dimensional pseudo-vocabulary/embedding.

Yeah i originally wa

admin said in #3268 4d ago:

>>3250
It works! But slow as all hell. Right now we fully rebuild the "taste machine", which now includes this word frequency tagger, about once a minute. That used to do 50 iterations of the expectation-maximization algorithm we use to refine everyone's tastes, which took a small fraction of a second. But now even 5 iterations takes 10 seconds. Brutal. Obviously we're not pushing it in this condition.

Most of that (92% of total) is the "learntags" function, which takes the previous tag-thread posteriors, counts up the probability-weighted word counts for each tag, normalizes them by tf-idf, and then re-tests every thread against every tag to get new word frequency evidence zscores (which are elsewhere counted up with our votes and indexing considerations into fully decided posteriors by the somewhat tricky "infertags" function).

The good news is at least 83% of total time is spent in the golang runtime doing map access and iteration. Most of that (65% of total time) is in the hellinger similarity function which compares two word frequency vectors (hellinger similarity replaced cosine similarity for autistic theoretical reasons but is effectively equivalent). Either golang maps are really slow, or everything else is really fast. For the sake of simplicity to start with I had used maps for the word->frequency vectors, but obviously that's about to change. When I previously wiped out maps from the rest of the taste machine, it got many multiples faster. Now I'm going to do the same here.

Here's the plan: we're first going to switch from strings as our word keys to indexes, leaving the word->index map in the corpus rebuild function which we run only once. We'll need two separate formats for our frequency vectors: a dense one at full vocab width for quickly adding up word counts, and a sparse compact one optimized for rapid scans for similarity comparison. Because of cache locality, simplicity etc this will all be about an order of magnitude faster than maps.

Separately, there are a lot of ways we can make the whole tag learning process "sparser". We can set a high noise floor on tf-idf frequencies so threads have about 2x fewer nonzero dimensions, skip writes to low probability tags for I'd guess about another 2x-4x fewer word count writes. Vectorization may help a bit too here and there. The big gains will come from taking all this stuff off the critical path and putting it in a background worker process. Right now page loads block on this stuff, which is no good. Offline computing power is cheap but realtime milliseconds are expensive. But I want to make it not insanely inefficient first.

As for actual statistical performance, I don't know what to expect yet. It is working though. I can see it rethinking the tags in real time as it occasionally does another 5 iterations. Once it's ready to go, we're going to have a big tagpocalypse where we find out our current set of tags sucks and we rename most of them to fit with the more scientific concepts this thing will discover. Is it worth it? Maybe. But it is cool.

It works! But slow a

You must login to post.