Date: 2022-07-24 04:39 pm (UTC)
spamsink: (0)
From: [personal profile] spamsink
The BWT algorithm does not look for any particular sequences; it performs an ingenious form of reversible sorting of the rotations of the input string, so that all characters preceding similar contexts are grouped together (reading an article about the algorithm soon after it first became known, in 1994, has caused me an "intellectual orgasm") which causes a lot of character repetitions; then a "move-to-front" transform, so that repeated characters are encoded as zeros, the next previously seen character is encoded as 1, etc.; then a run length encoding pass, then a Huffman coding pass.
As a result, most utf-8 first bytes would be grouped together and will shrink to almost nothing after MTF+RLE, which should have had almost no negative effect on the compression rate, but by themselves they could not bring about any positive effect, because extra bytes are extra bytes.

But, that perturbs the relative frequencies of the byte values, which may be enough to change the Huffman encoding, making it closer to the optimal, if it happened that in compressing of the original file the encoding was far from the optimal.
This account has disabled anonymous posting.
If you don't have an account you can create one now.
No Subject Icon Selected
More info about formatting

Profile

spamsink: (Default)
spamsink

November 2025

S M T W T F S
      1
2345678
910 11 12131415
16171819202122
23242526272829
30      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Nov. 13th, 2025 02:28 am
Powered by Dreamwidth Studios