• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

CHANGELOGH A D26-Jun-200779 73

COPYINGH A D02-May-200717.6 KiB341281

INSTALLH A D02-May-2007379 117

MakefileH A D03-May-20221.3 KiB3614

READMEH A D26-Jun-20074.3 KiB9372

TODOH A D02-May-20071.4 KiB2724

dotest.shH A D26-Jun-20071.3 KiB4523

humanunzip.cppH A D03-May-20228.7 KiB329252

humanversion.hH A D02-May-200725 21

humanzip.1H A D02-May-20073.9 KiB116115

humanzip.cppH A D03-May-202220.3 KiB692515

humanzip.hH A D26-Jun-20071.4 KiB5025

README

1Humanzip copyright (C) 2007 Matthew Strait
2
3*** Purpose and Description ***
4
5humanzip is a compression program that operates on text files.  Unlike
6most compression algorithms, its output is human readable.  Indeed, it
7is explictly meant to be read by humans and might even be easier to read
8than the original.
9
10humanzip compresses files by looking for common strings of words and
11replacing them with single symbols. The idea is to reduce the screen and
12print size of documents.  Humanzip does not explictly try to reduce the
13size of the file as measured in bytes, although this usually happens
14incidentally.  For instance, lines in a file that looks like this:
15
16	This is a test, please panic.
17	This is a test, hide under your desk.
18	This is a test, close the curtains.
19
20might be converted to:
21
22	Å, please panic.
23	Å, hide under your desk.
24	Å, close the curtains.
25
26A key is included at the top of the compressed file.  It is always given
27with lines like "å/Å - this is a test", where å represents "this is a
28test" and Å represents "This is a test".
29
30Don't expect dramatic compression here.  Most files will be reduced 5-15%.
31
32humanunzip will (in theory) restore files exactly to their original
33state. I've included in this package a very short shell script called
34"dotest.sh" to test this.
35
36If you don't like the exact way that humanzip has chosen abbreviations,
37you can change it around by hand.  As long as you follow the format,
38humanunzip will still work.
39
40Interestingly, sometimes you'll end up with a slightly smaller file if
41you humanzip it and then [gb]zip it than if you just [gb]zip it.  But
42sometimes it goes the other way, and in any case it's almost never going
43to be worth the trouble.
44
45*** Original motivation ***
46
47I compile and print a listing of all Magic: The Gathering cards several
48times a year so that I can carry it around with me when I play.  There
49are upwards of 8500 different cards, each with an average of ~150
50characters of text.  Even if I use a very small font and take care to
51avoid wasting space on the paper, the result is well over 100 pages.
52
53However, this text is very repetitive.  The word "creature" appears over
5410,000 times, and "target" over 3600 times, for instance.  There are
55roughly 1300 repetitions of "until end of turn" and 900 of "comes into
56play" and so forth.  By replacing these words and phrases with single
57characters, the print size can be reduced by ~13%.  (I also use some
58other tricks which are specific to Magic to reduce it somewhat further.
59For instance, each time the name of a card appears in that card's entry
60after the first time, I replace it with "«".)
61
62Once you've gotten used to it, "T: « deals 1 Đ to Ŧ Ĉ or ƣ" is no harder
63to read than "T: Prodigal Sorcerer deals 1 damage to target creature or
64player", and maybe easier because your eyes doen't have to move as far.
65
66I imagine that there are other people out there with lengthy, repetitive
67text which might benefit from some human readable compression.  While of
68course text with any topic can be operated on by humanzip, I envision it
69being used mainly on technical documents.  (Not only does it not work as
70well on narritive, since it is less repetitive, but I'm guessing most
71people would just be annoyed.)
72
73*** Limitations ***
74
75The programs you use for manipulating text must support UTF-8.  If
76between these quotation marks: "Å", you do not see capital A with a
77small circle above it, you're in trouble.  Fortunately, modern GNU/Linux
78distributions have full support for UTF-8, so you're probably fine.
79
80Because humanzip uses UTF-8 characters to replace the strings it finds,
81files that already use UTF-8 cannot be compressed.  In fact, humanzip
82will refuse to compress any file that has any non-ASCII characters (that
83is, bytes whose most significant bit is 1).  In the future, humanzip
84might be able to work around this by just not using any UTF-8 characters
85that are already there, but right now it just plays it safe.
86
87humanzip is optimized to work on English text.  This means two things:
88(1) it has a black list of English 'grammar' words that would be
89annoying to replace and (2) it looks for English-style plurals so that
90it can treat the sigular and plural as the same word.  Neither of these
91will prevent it from working on non-English text, but the result will
92not be as pleasing in some cases.
93