1$#<html> 2$#<head> 3$#<title>netrik hacker's manual: layout engine</title> 4$#</head> 5$#<body> 6 7$#<h1 align="center">netrik hacker's manual<br />>========================<</h1> 8� netrik hacker's manual 9�>========================< 10 11[This file contains a description of the layouting module. See hacking.txt or 12$$<a$+href="hacking.html">$$hacking.html$$</a>$$ for an overview of the manual.] 13 14$=$$<h2>$$$_0. Overview$_$$</h2>$$ 15 16The whole layouting is split up into several, fairly simple passes, which are 17executed one after the other. See the 18$$<a$+href="hacking.html#notes">$$notes in hacking.*$$</a>$$ for a discussion of 19this approach. 20 21The first pass is the $$<a$+href="#parseSyntax">$$$_parse_syntax()$_$$</a>$$ 22function, which creates a $$<a$+href="#syntaxTree">$$$_Syntax Tree$_$$</a>$$ of 23the document. This tree contains all HTML elements and their content, but the 24elements have no special meaning yet. 25 26$$<a$+href="#dumpTree">$$$_dump_tree()$_$$</a>$$ can be used to output the syntax 27tree. 28 29In the next pass ($ $$<a$+href="#parseElements">$$$_parse_elements()$_$$</a>$$$ ), 30all element and attribute names are looked up in tables and stored as enums to 31facialiate further processing. 32 33If not compiled with -DXHTML_ONLY, an additional pass is inserted after element 34parsing: In $$<a$+href="#sgmlRework">$$$_$5$.$ sgml_rework()$_$$</a>$$$ , the 35syntax tree is modified to fix the wrong element nesting caused by missing end 36tags in SGML documents. 37 38dump_tree() can be used again to dump all element and attribute types as found 39in the lookup, and the possibly modified tree structure. 40 41The third pass is the central processing step. 42$$<a$+href="#parseStruct">$$$_parse_struct()$_$$</a>$$ interprets the elements 43and their attributes, and creates a $$<a$+href="#itemTree">$$$_Structure 44Tree$_$$</a>$$$ , which contains all the items that will be visible on the 45output page. 46 47The fourth pass prepares the page for rendering. In 48$$<a$+href="#preRender">$$$_pre_render()$_$$</a>$$, all items created in 49parse_struct() are assigned actual sizes and positions in the output page. Also, 50a structure $$<a$+href="#pageMap">$$"page_map[]"$$</a>$$ is created, needed for 51fast lookup what items are present in any given line. 52 53All of the passes mentioned above are necessary to prepare the rendering, and 54are executed from $$<a$+href="#layout">$$$_layout()$_$$</a>$$$ . 55 56The actual rendering is done in render.c. However, this isn't done for the 57whole page like the other layouting passes. Instead, every time some region of 58the output page needs to be displayed, 59$$<a$+href="#render">$$$_render()$_$$</a>$$ is called to render exactly that 60region. 61 62Alternatively, The whole page can be dumped to the terminal line by line, using 63$$<a$+href="#dump">$$$_dump()$_$$</a>$$$ . 64 65The third function in render.c is $$<a$+href="#dumpItems">$$$_dump_items()$_$$</a>$$$ . 66This is not really a rendering function; it only dumps the item tree, including 67the (coloured) text. 68 69$=$$<h2>$$$_1. layout.c$_$$</h2>$$ 70 71This file forms the framework for the layouting process. It contains functions 72to load a file and prepare it for rendering, but also to free the memory used 73by a document when it is no longer needed. 74 75$#<a name="layout" id="layout"> 76 77$=$$<h3>$$$_layout()$_$$</h3>$$ 78 79layout() is given a URL of a file or web resource to load, and does all 80actions necessary to be able to render the corresponding page. 81 82Before starting any of the loading or layouting operations, a descriptor is 83allocated where all the data structures created inside layout() will be stored. 84 85The descriptor is a "struct Layout" pointer. It contains the following data: 86 87$#<ul> <li> 88��$- A pointer to the input ressource descriptor ("input") 89$#</li> <li> 90��$- An additional pointer "url" to the effective page URL, necessary to hold the 91��$ URL after the input resource descriptor is freed 92$#</li> <li> 93��$- Pointers to all data structures necessary for the layouting ("syntax_tree", 94��$ "item_tree", "page_map[]") 95$#</li> <li> 96��$- Pointers to the "$$<a$+href="hacking-links.html#linkList">$$link_list$$</a>$$" and "$$<a$+href="hacking-links.html#anchorList">$$anchor_list$$</a>$$" data structures 97$#</li> </ul> 98 99After allocating the descriptor, layout() first opens the resource with 100$$<a$+href="hacking-load.html#initLoad">$$init_load()$$</a>$$. (Described in 101hacking-load.*) 102 103Afterwards, $$<a$+href="#parseSyntax">$$$_parse_syntax()$_$$</a>$$$ , 104$$<a$+href="#parseElements">$$$_parse_elements()$_$$</a>$$$ , 105$$<a$+href="#sgmlRework">$$$_$5$.$ sgml_rework()$_$$</a>$$$ , 106$$<a$+href="#parseStruct">$$$_parse_struct()$_$$</a>$$$ , and 107$$<a$+href="#preRender">$$$_pre_render()$_$$</a>$$ are called in sequence. 108These functions are responsible for preparing the page for rendering. 109 110The file loading itself is done inside parse_syntax(), which uses the 111$$<a$+href="hacking-load.html#load">$$load()$$</a>$$ function from load.c (see 112$$<a$+href="hacking-load.html">$$hacking-load.*$$</a>$$) to read a data block 113every time the input buffer is empty. It processes the data in the buffer 114character by character (keeping track of the current read position by 115"input->buf_ptr"), and when it reaches the end it calls load() again to get the 116next data block. 117 118After parse_struct(), the syntax tree is no longer needed. It is freed by 119$$<a$+href="#freeSyntax">$$$_free_syntax()$_$$</a>$$$ . 120 121At this point also the "link_list" and "anchor_list" data structures are 122created using 123$$<a$+href="hacking-links.html#linkList">$$make_link_list()$$</a>$$ and 124$$<a$+href="hacking-links.html#anchorList">$$make_anchor_list()$$</a>$$. (See 125$$<a$+href="hacking-links.html">$$hacking-links.*$$</a>$$ ) 126 127$#</a> <!-- layout --> 128 129$#<a name="freeLayout" id="freeLayout"> 130 131$=$$<h3>$$$_free_layout()$_$$</h3>$$ 132 133When a page is unloaded (usually before loading a new page), this function is 134called to free all the data structures created by the layouting process to 135allow rendering (Item tree, page map, link list, anchor list), i.e. all data 136stored in the "Layout" descriptor, except for "input" and "syntax_tree", which 137are already freed during the layouting process. (s.a.) The descriptor itself is 138also freed. 139 140$#</a> <!-- freeLayout --> 141 142$#<a name="resize" id="resize"> 143 144$=$$<h3>$$$_resize()$_$$</h3>$$ 145 146The resize() function is somewhat similar to 147$$<a$+href="#layout">$$$_layout()$_$$</a>$$ -- it calls the same subfunctions 148to create a combination of an item tree with assigned coordinates and a page 149usage map. The difference is that resize() does not start from scratch, but 150only repeats the steps necessary to adapt to a new screen width; the properties 151determined by the document itself (i.e. the item tree) are kept. 152 153So we actually just call pre_render() (see $$<a$+href="#preRender">$$$_7. 154pre-render.c$_$$</a>$$$ ) again. (Note that the minimal item sizes calculated 155in $$<a$+href="calcWidth">$$$_calc_width()$_$$</a>$$$ could be kept also; for 156simplicity, we just pre-render completely again anyways -- this shouldn't be 157too big a loss, we believe. When implementing rendering of incompletely loaded 158pages, we will have to create some mechanism to skip such unnecessary 159re-calculations per item anyways.) 160 161As resize() starts from a "ready" page, not from scratch, it has to free the 162old data structures (page map) before creating new ones. 163 164$#</a> <!-- resize --> 165 166$#<a name="parseSyntax" id="parseSyntax"> 167 168$=$$<h2>$$$_2. parse-syntax.c$_$$</h2>$$ 169 170The first thing to be done when layouting is parsing the syntax of the input 171file. This is done by parse_syntax(). This function creates a syntax tree. The 172pointer to the head of this tree is returned to layout() and stored as 173"layout->syntax_tree" there. 174 175$=$$<h3>$$$_Syntax Tree$_$$</h3>$$ 176 177Every node of "syntax_tree" is a structure of the type "Element" (defined in 178"syntax.h") and corresponds to one HTML element. (An element in an HTML 179document is represented by an HTML-tag, and the corresponding end tag, if any.) 180End tags do not create tree nodes, as they only close elements already stored. 181 182$#<a name="testHtml" id="testHtml"> 183 184For the supplied "test/0.html": 185 186$� header text 187$� 188$� <html><head> </head> 189$� <body> 190$� <h1> heading </h1> 191$� <p> 192$� first paragraph of text; 193$� includes multiple spaces and newlines, 194$� <em> emphasized text </em>and 195$� <strong> strong text </strong> 196$� </p> 197$� <p> 198$� <center>starting with an evil center tag,</center> 199$� this very long second paragraph contains some special characters (including a simple space...): 200$� &; <>"=/ plus a big gap and two unicode escapes 201$� (decimal: ¡ and hexal: ¿) 202$� but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 203$� (this anchor also is the only tag with parameters); 204$� and finally a blank row <br /> (a single tag) 205$� </p> 206$� </body> 207$� </html> 208 209$#</a> <!--testHtml--> 210 211the syntax tree looks like this: 212 213$#<a name="syntaxTree" id="syntaxTree"> 214 215$� ++>NULL 216$� + 217$�+---+ 218$�| ! |-. <++ 219$�+---+ | + 220$� v + 221$� ,-----------. 222$�("header text") 223$� `-+------+--' 224$� | html |-. <++++++++ 225$� +------+ | <+ + 226$� v + + 227$� +------+ +------+ <+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 228$� | head |->| body |-. <++++++++++++++ + 229$� +------+ +------+ | <+ + + 230$� v + + + <+++++++++++++++++++++++++++++++++++++++++++++++++ 231$� +----++ +---+ <++++++++++++++++++++ +---+ <++++++++++++++++++++++++++++++++ + 232$� | h1 |-. <+ ,->| p |---. <++ + ,->| p |-. <++++++++++++++++++ + + (back 233$� +----+ | + | +---+ | + + | +---+ | <+ + + + to top) 234$� v + | v + + | | + + + + ^ 235$� ,--------. | ,------------------. ,----. | | + ,----------. ,---------. ,---------------. | 236$� (" heading") | (" first...newlines,") (" and") | v + (" this...em") ("ded...row") (" (a single tag)") | 237$� `--+---+-' | `-----+----+-------' +-+----+-+ | +--------+ `---+---+--' `--+----+-' `-----+---+-----' | 238$� | ? |----' | em |-. <++ ,->| strong |-. <++ | | center |-. <++ ,->| a |-. <+ ,->| br |---------->| ? |--------' 239$� +---+ +----+ | + | +--------+ | + | +--------+ | + | +---+ | + | +----+ +---+ 240$� v + | v + | v + | v + | 241$� ,----------------. | ,------------. | ,----------------. | ,---.+ | 242$� (" emphasized text") | (" strong text") | (" starting...tag,") | ("bed") | 243$� `--------+---+---' | `-----+---+--' | `--------+---+---' | +---+ | 244$�+++> "parent" | ? |------' | ? |-----' | ? |------' | ? |--' 245$�---> "list_next" +---+ +---+ +---+ +---+ 246 247$#</a> <!--syntaxTree--> 248 249(I'm really curious if anyone can read this ;-) ) 250 251The "Element" structure includes: 252 253$#<ul> <li> 254��$- The pointers "list_next" and "parent" describe the tree structure. 255��$ "parent" points to the element (node) which contains this one in its content 256��$ (text) area. "list_next" points to the next element as it appears in the 257��$ input stream. 258$#</li> <li> 259��$- The "closed" flag is a helper flag for $$<a$+href="#sgmlRework">$$$_$5$.$ sgml_rework()$_$$</a>$$ and has 260��$ no meaning outside of it. 261$#</li> <li> 262��$- The union "name" describes what kind of element this node represents. ("html", 263��$ "head" etc.) It can store the element name either as a pointer to a string 264��$ (as appears in the input stream), or as an enum number. 265$#</li> <li> 266��$- "attr_count" stores the number of attributes of this element. (Attributes 267��$ are the parameters of an element, which appear inside the start tag, like: 268��$ href="foo" etc.) 269$#</li> <li> 270��$- "attr" points to an array of "Attr" structures. Each of these structures 271��$ contains the data for one attribute; it consists of a union of type 272��$ "Attr_name", which, like "Element_name", stores the attribute name either as a 273��$ string or as an enum; and a union of type "Attr_value", which stores the 274��$ value of the attribute. (String or number.) 275$#</li> <li> 276$#<a name="elementText"> 277��$- The "content" string stores the content. (The text between the tags.) Every 278��$ element stores the content between the previous tag and the start tag of this 279��$ element. Thus it does not store the content of the element itself, but part 280��$ of the content of the *parent* element. This simplifies processing a bit, 281��$ because this way no facility for storing content blocks divided by 282��$ sub-elements is needed -- the sub-elements store the content themselves. The 283��$ caveat is that a lot of dummy elements are needed to store the content if no 284��$ further sub-element follows them. This is quite a big inefficiency, as nearly 285��$ every real element also needs a dummy element to store its content. This 286��$ should change in the future -- if we won't drop the syntax-tree in its 287��$ present form at all... Which we will :-) 288$#</a> <!--elementText--> 289$#</li> </ul> 290 291$=$$<h3>$$$_Initialization$_$$</h3>$$ 292 293Before starting parsing, we have to create the tree top. (We call it the global 294element.) This is done by setting "cur_el" to NULL and calling add_element(). 295 296$#<a name="addElement" id="addElement"> 297 298$=$$<h4>$$$_add_element()$_$$</h4>$$ 299 300This function creates a new node and inserts it into the syntax tree; thus it 301has to set the "parent" and "list_next" pointers too, and adjust some 302pointes of other nodes to point to this one. 303 304"parent" is set to "cur_el", as any new tag is created while parsing the 305content area of its parent. "list_next" is set to NULL, as the new node is 306always the last one in the list. "list_next" of "last_el" (the last node in the 307list up to now) is set to point to the new node; this is omitted if "cur_el" is 308NULL, indicating that there are no other nodes yet. 309 310$#</a> <!--addElement--> 311 312$#<a name="parsing" id="parsing"> 313 314$=$$<h3>$$$_Parsing$_$$</h3>$$ 315 316The parser itself works in a very simple way. It is some kind of state machine. 317For every input character, one action is taken, selected by a dispatcher 318depending on the current state (stored in "parse_mode") and the input char 319itself. Several combinations (e.g. tag start) change the current state, thus 320the following character(s) are parsed in a different mode. (Other actions are 321taken.) 322 323Sometimes a character that causes a mode switch has to be parsed in the new 324mode itself. In this case the flag "recycle" is set after the mode change, 325causing the dispatch to be repeated for the same char, but in the new mode. 326 327Again, the parsing is not very efficient in the present implementation. (In 328fact, it is by far the most time consuming part of the whole layouting.) 329Especially the huge switch is quite slow. (Good compilers have a fairly 330efficient implementation of the switch itself; however, it still causes many 331unpredictable branches.) There are some possibilities to optimize this. The 332bigger problem is that the inner loop is quite big, and may not fit into the 333processor's instruction cache, thus making it terribly slow. Maybe splitting 334the parsing into several simpler passes would help. However, we are planning to 335switch to a completely different, (hopefully) much more efficient parser system 336in the next major release... 337 338The default parsing mode is "PM_CONTENT", which is the mode for parsing element 339content. Any normal character encountered in this mode is simply added to 340"text_buf" by "buf_add_char()". A ' ', '\t', '\n', '\r' or '\f' aren't stored; 341we switch to "PM_BLANK" instead. Any following blank space is ignored. As soon 342as a normal character occurs again, we store a single ' ' and swich back to 343"PM_CONTENT". 344 345$�input: 346$� first paragraph of text; 347$� includes multiple spaces and newlines, 348$� ^ 349$� file position 350$� 351$�text_buf: " first paragraph of text; includes mul" 352 353$=$$<h4>$$$_<pre> Blocks$_$$</h4>$$ 354 355After a <pre> tag, the mode isn't switched back to "PM_CONTENT", but to 356"PM_PRE". In this mode all blank space characters are stored to "text_buf" 357as non-breakable spaces, except newlines which are stored directly. 358 359The mode is ended and switched to "PM_CONTENT" again when a closing "</pre>" 360tag is encountered. 361 362PM_PRE is also (mis-)used for <textarea>: The content of a <textarea> is used 363as the initial value; and this is plain text, so it has to be treated 364literally, without messing with the blanks. Thus it can be handled similar to 365<pre>, except that blanks are really stored directly, not even converted to 366 . The "textarea" flag indicates we are in a <textarea> not a real <pre>, 367and this exception needs to be applied. 368 369Note that this is quite a dirty hack, which may not work in all situations. 370(<textarea> inside <pre>...) However, the new parser in 2.x will handle this 371totally diffent anyway, so it's not worth more effort with the old parser. 372 373$=$$<h4>$$$_References$_$$</h4>$$ 374 375An '&' indicates a character reference (unicode escape) or entity reference 376(named escape), and starts the reference parsing mode "PM_AMP". On entering 377this mode, the current write position in "text_buf" is saved to "amp_pos". 378 379$�input: [...] < [...] 380$� ^ 381$�text_buf: "[...] &" 382$� ^^ 383$� text_buf_len 384$� amp_pos 385 386There are several submodes in the PM_AMP family, keeping track of the reference 387syntax -- this is necessary in SGML mode, as there is no other method to 388reliably discover the end of the reference or a '&' character which isn't 389actually a reference. 390 391Nonetheless, all characters occuring in any of these submodes are added to 392"text_buf"; actually evaluating the reference is done only when it's end is 393encountered. The mode is then switched back to the previous parsing mode before 394the reference occured (saved in "prev_mode_amp") -- references can occur in 395content and in attribute values. 396 397$�input: [...] < [...] 398$� ^ 399$�text_buf: "[...] <" 400$� ^ ^ 401$� text_buf_len 402$� amp_pos 403 404The text between the saved start position of the escape sequence and the 405current positon is converted then, depending on the type of the reference. 406 407If it's a symbolic (named) reference, the string is looked up in "ref_table[]", 408which is a table of named characters, defined in facilities.c$ . 409 410For numerical character references, the integer value is extracted, using 411decimal or hexadecimal conversion, depending on whether the number starts with 412'x'. (Note that we have different parser states for decimal and hexal; however, 413testing for the 'x' instead of the parser state saves us one condition, as we 414would have to test the 'x' anyways -- the parser treats every alphanumerical 415sequence starting with a letter as a hex number.) 416 417If a replacement char was found either in the table or by the number 418conversion, the reference is removed from "text_buf" and the replacement char 419is inserted instead. 420 421$�text_buf: "[...] <" 422$� ^ 423$� text_buf_len 424 425If no replacement was found, the string is left unchanged. Probably we will 426mark unknown escapes with some visible attribute in the future. 427 428$=$$<h4>$$$_Tags$_$$</h4>$$ 429 430A '<' starts tag parsing. There is a a whole bunch of tag parsing modes. The 431one entered after the '<' is "PM_TAG_START", which indicates that the tag name 432should follow next. 433 434$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 435$� ^ 436$�text_buf: "[...] but also an anchor em" 437$�tree: 438$� + 439$� + 440$� +---+ 441$�-->| p |-. <== cur_el 442$� +---+ | <+ 443$� | + 444$� | + 445$� v + 446$� +--------+ 447$� | center |-. <++ 448$� +--------+ | + 449$� v + 450$� ,----------------. 451$� (" starting...tag,") 452$� `--------+---+---' <-- last_el 453$� | ? |->NULL 454$� +---+ 455 456$=$$<h5>$$$_Start Tags$_$$</h5>$$ 457 458If the following character is an normal char, it's a start tag. (Or a single 459tag, which is treated the same way for now.) "PM_TAG_NAME" is entered. A new 460element node is created by 461$$<a$+href="#addElement">$$$_add_element()$_$$</a>$$$ . Any content in front of 462this new element, which was stored in "text_buf" up to now, is stored to the 463new node's "$$<a$+href="#elementText">$$text$$</a>$$" field by "insert_buf". 464 465$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 466$� ^ 467$�text_buf: "" 468$�tree: 469$� + 470$� +---+ 471$�->| p |-. <++++++++++++++++++ 472$� +---+ | <+ + 473$� | + <== + 474$� | + ,----------. 475$� v + (" this...em") 476$� +--------+ `---+---+--' <-- 477$� | center |-. <++ ,->| |->NULL 478$� +--------+ | + | +---+ 479$� v + | 480$� ,----------------. | 481$� (" starting...tag,") | 482$� `--------+---+---' | 483$� | ? |------' 484$� +---+ 485 486Normal characters encounterd in "PM_TAG_NAME" mode (including the one that 487started the mode) are stored to "text_buf". 488 489$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 490$� ^ 491$�text_buf: "a" 492 493A blank space character ends "PM_TAG_NAME" and switches to "PM_TAG", which indicates that 494attributes may follow. "text_buf" is stored as the element name. 495 496$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 497$� ^ 498$�text_buf: "" 499$�tree: 500$� + 501$� +---+ 502$�->| p |-. <++++++++++++++++++ 503$� +---+ | <+ + 504$� | + <== + 505$� | + ,----------. 506$� v + (" this...em") 507$� +--------+ `---+---+--' <-- 508$� | center |-. <++ ,->| a |->NULL 509$� +--------+ | + | +---+ 510$� v + | 511$� ,----------------. | 512$� (" starting...tag,") | 513$� `--------+---+---' | 514$� | ? |------' 515$� +---+ 516 517A following normal char is the beginning of an attribute name, and switches to 518"PM_ATTR_NAME". Characters encounterd in "PM_ATTR_NAME" mode are also stored to "text_buf". 519 520$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 521$� ^ 522$�text_buf: "name" 523 524The attribute name ends with an '=' or a blank char. A new entry is created in the "attr[]" 525array, and "text_buf" is stored as the attribute name. Mode is switched to 526"PM_ATTR_NAME_END" first, which indicates that the attribue value should follow. 527 528$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 529$� ^ 530$�text_buf: "" 531$�attr: name: data: 532$� "name" "" 533 534If the attribute name was ended by an '=', mode is switched immediately to 535"PM_ATTR_VALUE", otherwise as soon as an '=' is encountered. (After any amount 536of whitespace.) 537 538White space in "PM_ATTR_VALUE" mode (after the '=') is ignored too. 539 540Next char must be either a '"' or a '\'', and switches to "PM_ATTR_DATA_QUOT" 541or "PM_ATTR_DATA_APOS", respectively. In this modes characters are stored to 542"text_buf" again. 543 544$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 545$� ^ 546$�text_buf: "anchor" 547$�attr: name: data: 548$� "name" "" 549 550A second '"' (or '\'', respectively) ends this mode. "text_buf" is stored as 551the attribute value for the new "attr" entry. 552 553$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 554$� ^ 555$�text_buf: "" 556$�attr: name: data: 557$� "name" "anchor" 558 559Mode is swiched back to "PM_TAG". Now blank space may follow (which is 560ignored), followed by another attribute. 561 562$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 563$� ^ 564 565In "PM_TAG" mode also a '>' may occur, ending tag parsing and switching back to 566the mode before tag parsing had begun. (PM_CONTENT or PM_BLANK.) In this case, 567"cur_el" is set to "last_el"; this means descending in the syntax tree to the 568newly created node. 569 570$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 571$� ^ 572$�text_buf: "" 573$�attr: name: data: 574$� "name" "anchor" 575$� "href" "" 576$�tree: 577$� + 578$� +---+ 579$�->| p |-. <++++++++++++++++++ 580$� +---+ | <+ + 581$� | + + 582$� | + ,----------. 583$� v + (" this...em") 584$� +--------+ `---+---+--' <== 585$� | center |-. <++ ,->| a |->NULL <-- 586$� +--------+ | + | +---+ 587$� v + | 588$� ,----------------. | 589$� (" starting...tag,") | 590$� `--------+---+---' | 591$� | ? |------' 592$� +---+ 593 594A '>' may also occur in "PM_TAG_NAME" mode, meaning the element has no 595attributes. 596 597$�input: <html> <head> [...] 598$� ^ 599 600In this case creating the new node and storing the name, and descending into 601the element are done in one step. (By "recycle".) 602 603$=$$<h5>$$$_End Tags$_$$</h5>$$ 604 605If the first character after the '<' is a '/', the tag is an end tag, and we 606switch to "PM_END_TAG_START", and then to "PM_END_TAG_NAME" on the first 607letter. 608 609If any text was pending in "text_buf" before the tag, we have to store it 610somewhere. As an end tag normally does not create a new element node, we have 611to create a dummy node for this. (Very inefficient, s.a.) 612 613$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 614$� ^ 615$�tree: 616$� + 617$� +---+ 618$�->| p |-. <++++++++++++++++++ 619$� +---+ | <+ + 620$� | + + 621$� | + ,----------. 622$� v + (" this...em") 623$� +--------+ `---+---+--' <== 624$� | center |-. <++ ,->| a |-. <+ 625$� +--------+ | + | +---+ | + 626$� v + | v + 627$� ,----------------. | ,---.+ 628$� (" starting...tag,") | ("bed") 629$� `--------+---+---' | +---+ <-- 630$� | ? |------' | ? |->NULL 631$� +---+ +---+ 632 633Normal chars in "PM_END_TAG_NAME" mode are stored to "text_buf", too. 634"PM_END_TAG_NAME" can be ended immediately by a '>', or by blank space 635(switching to "PM_END_TAG_SPACE") followed by '>'. 636 637The tag name extracted to "text_buf" is compared against the element name of 638the current element, to see if the end tag matches, and then abdannoned. The 639element is closed by ascending to the parent. 640 641$�input: [...] but also an anchor em<a name="anchor" href="">bed</a>ded inside a word 642$� ^ 643$�tree: 644$� + 645$� +---+ 646$�->| p |-. <++++++++++++++++++ 647$� +---+ | <+ + 648$� | + <== + 649$� | + ,----------. 650$� v + (" this...em") 651$� +--------+ `---+---+--' 652$� | center |-. <++ ,->| a |-. <+ 653$� +--------+ | + | +---+ | + 654$� v + | v + 655$� ,----------------. | ,---.+ 656$� (" starting...tag,") | ("bed") 657$� `--------+---+---' | +---+ <-- 658$� | ? |------' | ? |->NULL 659$� +---+ +---+ 660 661$=$$<h5>$$$_Single Tags$_$$</h5>$$ 662 663If a '/' appears instead of an attribute name in "PM_TAG" mode, "parse_mode" is 664set to "PM_SINGLE_TAG", indicating an (XML) single tag. 665 666$�input: [...] and finally a blank row <br /> (a single tag) 667$� ^ 668 669The '/' can also immediately follow the element name. (In "PM_TAG_NAME" mode.) 670 671$�input: <hr/> 672$� ^ 673 674In this case, creating the node and switching to a single tag are done in one 675step by "recycle". 676 677In any case, a '>' has to follow, and switches back to normal mode just like in 678a start tag, only it does not descend (set the new node as "cur_el") -- a 679single tag has no content area; the content following a single tag still 680belongs to the parent. 681 682$�input: [...] and finally a blank row <br /> (a single tag) 683$� ^ 684$�tree: 685$� + 686$� +---+ <++++++++++++++++++++++++++++++++ 687$�->| p |-. <++++++++++++++++++ + 688$� +---+ | <+ + + 689$� | + <== + + 690$� | + ,----------. ,---------. 691$� v + (" this...em") ("ded...row") 692$� +--------+ `---+---+--' `--+----+-' <-- 693$� | center |-. <++ ,->| a |-. <+ ,->| br |->NULL 694$� +--------+ | + | +---+ | + | +----+ 695$� v + | v + | 696$� ,----------------. | ,---.+ | 697$� (" starting...tag,") | ("bed") | 698$� `--------+---+---' | +---+ | 699$� | ? |------' | ? |--' 700$� +---+ +---+ 701 702$=$$<h4>$$$_Comments$_$$</h4>$$ 703 704In "PM_TAG_START" mode (after the '<'), also an '!' can follow, indicating that 705we have not any tag at all, but either a comment, a DOCTYPE declaration, or a 706CDATA section. "parse_mode" is set to "PM_EXCLAM" in this case. 707 708$�input: some text <!--a test-comment--> and more text 709$� ^ 710 711If a '-' follows, it's a comment. Mode is switched to "PM_COMMENT_START". 712 713$�input: some text <!--a test-comment--> and more text 714$� ^ 715 716Now a second '-' has to follow, switching to "PM_COMMENT". In this mode any 717characters but a '-' are simply ignored. 718 719$�input: some text <!--a test-comment--> and more text 720$� ^ 721$�text_buf: "some text" 722 723A '-' switches to "PM_COMMENT_END1", which means that it *may* be the comment end. 724 725$�input: some text <!--a test-comment--> and more text 726$� ^ 727 728However, if it is followed by any other char than a second '-', mode is 729switched back to "PM_COMMENT". 730 731$�input: some text <!--a test-comment--> and more text 732$� ^ 733 734A second '-' in "PM_COMMENT_END1" switches to "PM_COMMENT_END2", which means 735that now the comment really ends. 736 737$�input: some text <!--a test-comment--> and more text 738$� ^ 739 740Now the '>' has to follow, and switches back to parsing mode before the 741beginning of the comment ("prev_mode_tag"). 742 743$�input: some text <!--a test-comment--> and more text 744$� ^ 745 746$=$$<h4>$$$_DOCTYPE Declarations$_$$</h4>$$ 747 748If a normal char occurs in "PM_EXCLAM", we assume it is the "D" in "<!DOCTYPE". 749 750$�input: garbage <!DOCTYPE somedoc> more garbage 751$� ^ 752 753We treat DOCTYPE declarations as comments. Any characters but '>' are ignored. 754 755$�input: garbage <!DOCTYPE somedoc> more garbage 756$� ^ 757$�text_buf: "garbage" 758 759A '>' returns to normal mode. 760 761$�input: garbage <!DOCTYPE somedoc> more garbage 762$� ^ 763 764This isn't a very reliable detection, as according to the grammer, an unescaped 765'>' may appear in some system literal inside the declaration. However, we 766assume that this won't happen... (We would have to parse the whole declaration 767otherwise.) 768 769$=$$<h4>$$$_CDATA Sections$_$$</h4>$$ 770 771A '[' in "PM_EXCLAM" mode starts a CDATA section, indicated by 772"PM_CDATA_START". If there is a pending blank ("prev_mode_tag" is "PM_BLANK"), 773it has to be stored *before* the CDATA. 774 775$�input: some text <![CDATA[a tricky ]> CDATA section]]> and more text 776$� ^ 777$�text_buf: "some text " 778 779Following normal chars (should) belong to the "CDATA" string, and are ignored. 780 781$�input: some text <![CDATA[a tricky ]> CDATA section]]> and more text 782$� ^ 783$�text_buf: "some text " 784 785A second '[' in "PM_CDATA_START" mode switches to "PM_CDATA", indicating that 786the actual data will follow. 787 788$�input: some text <![CDATA[a tricky ]> CDATA section]]> and more text 789$� ^ 790 791Any characters in "PM_CDATA" mode but '>' are stored directly to "text_buf". 792 793$�input: some text <![CDATA[a tricky ]> CDATA section]]> and more text 794$� ^ 795$�text_buf: "some text a tricky ]" 796 797When a '>' occurs, the previous two chars (in "text_buf") are tested against 798"]]". If they do not match, the '>' is simply stored just as any other 799character. 800 801$�input: some text <![CDATA[a tricky ]> CDATA section]]> and more text 802$� ^ 803$�text_buf: "some text a tricky ]>" 804 805If they match, the last two characters are removed from "text_buf" (they belong 806to the CDATA terminator), and mode is switched back to "PM_CONTENT". (It 807doesn't need to be switched back to the mode before the CDATA section, as any 808pending blanks already have been stored, and a CDATA section can't start in 809other modes than "PM_TEXT" or "PM_BLANK".) 810 811$�input: some text <![CDATA[a tricky ]> CDATA section]]> and more text 812$� ^ 813$�text_buf: "some text a tricky ]> CDATA section]]" 814 815$�input: some text <![CDATA[a tricky ]> CDATA section]]> and more text 816$� ^ 817$�text_buf: "some text a tricky ]> CDATA section" 818 819$=$$<h4>$$$_Processing Instructions$_$$</h4>$$ 820 821The '<' may also be followed by a '?', indicating a processing instruction. 822Mode is switched from "PM_TAG_START" to "PM_INSTR". 823 824$�input: some text <?a fake? processing instruction??> more text 825$� ^ 826 827Processing instructions are also treated as comments. Any chars but '?' are 828ignored in "PM_INSTR". 829 830$�input: some text <?a fake? processing instruction??> more text 831$� ^ 832$�text_buf: "some text" 833 834A '?' switches to "PM_INSTR_END", indicating this *may* be the end of the 835processing instruction. 836 837$�input: some text <?a fake? processing instruction??> more text 838$� ^ 839 840If a normal char follows the '?', mode is switched back to "PM_INSTR". 841 842$�input: some text <?a fake? processing instruction??> more text 843$� ^ 844 845If a second '?' follows, "PM_INSTR_END" is kept, as the first one isn't the end 846of the processing instruction, but the new one could be. 847 848$�input: some text <?a fake? processing instruction??> more text 849$� ^ 850 851A '>' in PM_INSTR_END really ends the processing instruction, and switches to 852"prev_mode_tag". 853 854$�input: some text <?a fake? processing instruction??> more text 855$� ^ 856 857$=$$<h3>$$$_SGML Mode$_$$</h3>$$ 858 859When compiled without the "-DXHTML_ONLY" option, a few cases more are possible. 860 861$=$$<h4>$$$_Unclosed Tags$_$$</h4>$$ 862 863In SGML, not every element has to have an end tag. 864 865When an end tag is encountered, we ascend in the syntax tree not only once, but 866until an element is found that matches the end tag. Thus, all elements in 867between are automatically closed. 868 869$�input: 870$�<body> 871$� <p> 872$� some text 873$� <hr> 874$� </p> 875$� ^ 876$� 877$�+------+ 878$�| body |-.<+ 879$�+------+ | + 880$� v + 881$� +---+ 882$� | p |-.<+ 883$� +---+ | + 884$� v + 885$� ,-------. 886$� (some text) 887$� `-+----+' 888$� --> | hr |->NULL 889$� ==> +----+ 890 891$�input: 892$�<body> 893$� <p> 894$� some text 895$� <hr> 896$� </p> 897$� ^ 898$� 899$�+------+ 900$�| body |-.<+ <== 901$�+------+ | + 902$� v + 903$� +---+ 904$� | p |-.<+ 905$� +---+ | + 906$� v + 907$� ,-------. 908$� (some text) 909$� `-+----+' 910$� --> | hr |->NULL 911$� +----+ 912 913$=$$<h4>$$$_Unquoted Attribute Values$_$$</h4>$$ 914 915When a normal char occurs in "PM_ATTR_VALUE" mode, "PM_ATTR_DATA_NOQUOTE" is entered. 916 917$�input: <sometag someattribute=somevalue minimized third="nothing"> 918$� ^ 919 920This mode is just like "PM_ATTR_DATA_QUOT" or "PM_ATTR_DATA_APOS", only it 921is ended by a blank or the tag end. 922 923$�input: <sometag someattribute=somevalue minimized third="nothing"> 924$� ^ 925 926$=$$<h4>$$$_Mimimized Attributes$_$$</h4>$$ 927 928In SGML, attributes without a value are possible. This is recognized when a 929normal char or the tag end occurs in "PM_ATTR_NAME_END" mode instead of the 930'='. 931 932$�input: <sometag someattribute=somevalue minimized third="nothing"> 933$� ^ 934 935The attribute is ended immediately. "text_buf[]" (which is empty in this case) 936is stored just like at the end of an unquoted attribute value. Mode is set to 937"PM_TAG", and the current character (the tag end or beginning of next 938attribute) is processed in this mode. 939 940$=$$<h4>$$$_SGML Comments$_$$</h4>$$ 941 942Comments also allow more complicated syntax. For one, blank space is possible 943between the "--" ending the comment string and the '>' ending the declaration. 944Thus, blank space in PM_COMMENT_END2 is ignored. 945 946Moreover, another comment string may follow the end. Thus, a '-' in 947PM_COMMENT_END2 switches back to PM_COMMENT_START, similary to the '-' after 948the "<!". 949 950$�input: <!--comment start-- --second comment string in same declaration-- > 951$� ^ 952 953Finally, SGML also allows empty declarations ("<!>"), which are also a kind of 954comment. Thus a '>' in PM_EXCLAM switches immediately to PM_COMMENT_END2 and 955recycles. 956 957$�input: <!> 958$� ^ 959 960$=$$<h4>$$$_Unclosed Tags$_$$</h4>$$ 961 962In SGML, tags needn't be closed by '>', if the tag end can be deduced from the 963context. In practise, this means that a tag can also be ended by a '<' 964character, which may be the beginning of a following tag. Thus we have to 965handle this in all situations where a '>' could also occur. 966 967$=$$<h4>$$$_Unhandled Constructs$_$$</h4>$$ 968 969SGML also allows some constructs that aren't recognized by any browser we know 970of. These include empty tags (<> and </>) and "net mode". 971 972Netrik recognizes these constructs and prints a warning, but doesn't handle 973them either -- there is no point in this, as autors couldn't use them anyways 974due to lack of support in other browsers. Handling them correctly would 975actually even break some pages, because it would behave different than all 976other browsers. 977 978$=$$<h4>$$$_Loose '&' and '<' Chars$_$$</h4>$$ 979 980If some illegal char occurs in a entity/character reference, it's not really a 981reference, but an unescaped '&'. We keep the whole sequence literally and 982switch back to "prev_mode_amp". 983 984$�input: x = a & b 985$� ^ 986$�text_buf: "x = a & " 987 988Similar for illegal characters in "PM_TAG_START" (and some other PM_TAG* 989modes), which indicate an unescaped '<'. We store a '<' and switch back to 990"prev_mode_tag". 991 992$�input: if(a < b) 993$� ^ 994$�text_buf: "< " 995 996$#</a> <!-- parsing --> 997 998$=$$<h3>$$$_Finishing$_$$</h3>$$ 999 1000Parsing is ended by EOF. This should only appear in "PM_CONTENT" or "PM_BLANK" 1001mode (not inside some tag, comment, CDATA section or chracter/entity 1002reference), and only if the current element is the global one (not while 1003parsing some element's content). 1004 1005The "list_next" pointer of the last node is set to point back to the tree top. 1006This faciliates easier processing in the following steps. 1007 1008$=$$<h3>$$$_Error Handling$_$$</h3>$$ 1009 1010When using -DXHTML_ONLY, every syntax error encountered causes netrik to print 1011an error message and immediately quit. (The XML standard requires this.) 1012 1013Without -DXHTML_ONLY, netrik is more tolerant. 1014 1015$=$$<h4>$$$_Workarounds$_$$</h4>$$ 1016 1017Netrik uses simple workarounds for some of the most common cases of broken 1018HTML. 1019 1020Most notable is comment parsing: As SGML comments have a quite complicated 1021syntax, reasonable error handling is also quite complicated. 1022 1023If someting else then '>' (end of comment declaration), '-' (beginning of 1024second comment string), or blank space follows in PM_COMMENT_END2 (after a 1025"--"), then the "--" was probably not intended to have any special meaning, but 1026simply to be part of the comment. Thus, mode is switched back to PM_COMMENT. 1027 1028$�input: <!-- some broken -- comment --> 1029$� ^ 1030 1031The same is done for unexpected characters in PM_COMMENT_RESTART mode, which is 1032most common for "---" inside a comment. 1033 1034$�input: <!-- some broken --- comment --> 1035$� ^ 1036 1037There is one exception to this, however: If a '>' follows in PM_COMMENT_START 1038mode, and it was preceeded not only by one '-' (the one which started 1039PM_COMMENT_START) but two or more, then the the '>' together with the last two 1040'-' was probably intended as as an XML-like "-->" comment end. 1041 1042$�input: <!--- anything ---> 1043$� ^ 1044$�parse_mode: PM_COMMENT_START 1045$�dash_count: 3 1046 1047The "dash_count" variable keeps track of how many dashes have been encountered 1048in a row; it is incremented every time a '-' apprears in some of the comment 1049parsing modes, and is reset to 0 every time some other character is 1050encountered. 1051 1052"dash_count" is also used in another situation: If a '>' follows in PM_COMMENT 1053or PM_COMMENT_END1, normally it is part of the comment. The '>' is ignored and 1054mode stays PM_COMMENT. (Or is switched back from PM_COMMENT_END1.) 1055 1056$�input: <!-- comment with > and -> in it --> 1057$� ^ 1058 1059However, if there were two or more dashes in front of the '>', this "-->" 1060combination was probably also intended as a comment end. A comment consisting 1061of a series of dashes is a typical example: 1062 1063$�input: <!------> 1064$� ^ 1065$�parse_mode=PM_COMMENT 1066$�dash_count=6 1067 1068However, only a little warning can be printed in this case -- this is valid 1069SGML, and *has* to be treated as part of the comment, even if it's probably not 1070what the page author intended! Printing an error and using a workaround would 1071mean deliberately to violate the standard in favour of broken pages, which is 1072probably not a very good idea... 1073 1074There is another trick however, which contervails this in most situations: As 1075soon as any clear error is detected, a "broken" flag is set for the time of 1076this comment. If the above situation occurs afterwards, we treat it as an error 1077and abort the comment -- as we are sure that the comment has errors, there is 1078no point in continuing as if the comment was correct. 1079 1080There are also a couple of specific workarounds for tags: 1081 1082Spurious quotes inside the attribute value are quite common when the autor 1083forgets the opening quote but not the closing one. These have to be ignored. (A 1084warning is printed, but they aren't stored as part of the value or handled 1085otherwise.) 1086 1087Very often we find illegal characters in unquoted attribute values. (According 1088to the standard, only name characters are allowed here.) These produce a 1089warning, but are otherwise handled like legal chars -- as long as there are not 1090ambigious. (A '<' is always an error for example, as it usually indicates 1091another tag start.) 1092 1093Other unexpected characters in tags (e.g. "a<b =" or something the like) are 1094handled by immediatly aborting tag parsing and returning to normal mode. This 1095seems the surest bet, because such a situation usually indicates that the 1096construct wasn't intended as a tag at all, only looked similar by incident. By 1097bailing out as soon as possible, we try to limit the damage -- staying in tag 1098mode might produce more critical problems, like hiding or misinterpreting 1099considerable parts of the remaining document. 1100 1101Note that even better would be storing the whole preceeding part of the 1102presumed tag literally as content. it would be much more complicated however; 1103we haven't bothered to implement this. 1104 1105Unexpected characters in any other mode are simply ignored, hoping for the 1106best. 1107 1108$=$$<h4>$$$_html_error()$_$$</h4>$$ 1109 1110Whenever some syntax error is detected (no matter whether workarounds are 1111available), html_error() is called, with several parameters describing the 1112error. This function is responsible for everything that needs to be done when 1113an error occurs. 1114 1115Before taking any action, the requested error message is tested against an 1116array with all errors printed so far. Only if the message is new, the function 1117proceeds; otherwise, an "ignored"-counter is incremented, and the function 1118returns early. 1119 1120Only now html_error() starts it's normal operation: First, it prints an the 1121message. The message text is passed from parse_syntax(), and used as the format 1122string for printf(). If the error message requires additional arguments, they 1123are passed at the end of the parameter list when calling html_error(). 1124 1125If the parsing mode requires that, html_error() quits immediately afterwards. 1126The mode is determined by the config variable "cfg.parser", which is an "enum 1127Parser_mode", with the possible values FUSSY_HTML, CLEAN_HTML, VALID_HTML, 1128BROKEN_HTML and IGNORE_BROKEN. The parser quits only in FUSSY_HTML mode, or 1129when -DXHTML_ONLY is enabled. If the input resource from which the page is 1130loaded is a pipe from wget (see 1131$$<a$+href="hacking-load.html">$$hacking-load.*$$</a>$$$), the pipe is closed 1132before quitting to assure a cleaner exit. 1133 1134In all other modes, an additonal message passed from parse_syntax() is printed 1135afterwards, informing in which way netrik will handle the error. (workaround, 1136ignore etc.) 1137 1138Finally the error level passed from parse_syntax() is compared against the 1139highest error level up to know, and the new higest level is returned. 1140 1141$#$$<a name="warn" id="warn"> 1142 1143$=$$<h4>$$$_Warning messages$_$$</h4> 1144 1145parse_syntax keeps track of most severe syntax error that was found while 1146parsing the page in "err_level", which is of type "enum Syntax_error" and can 1147have the following values: 1148 1149$#<ul> <li> 1150��$- SE_NO: No errors were found 1151$#</li> <li> 1152��$- SE_BREAK: The user issued an interrupt (SIGINT) while loading the document. 1153��$ This isn't really an error, but can be handled very convenient this way... 1154$#</li> <li> 1155��$- SE_DISCOURAGED: Some constructs were found that are strictly speaking valid 1156��$ SGML, but explicitely discouraged in the HTML standard. These may be handled 1157��$ differently by other browsers -- especially comments. 1158$#</li> <li> 1159��$- SE_UNIMPLEMENTED: Also valid SGML and discouraged in HTML, but not handled 1160��$ correctly by netrik nor any other popular browser. 1161$#</li> <li> 1162��$- SE_WORKAROUND: Real errors were found, but workarounds could be applied that 1163��$ work in most cases. 1164$#</li> <li> 1165��$- SE_CRITICAL: Something went terribly wrong: We have an error situation which 1166��$ we can not make out, and thus no useful workaround could be applied. The page 1167��$ almost certainly will look broken, often with considerable parts or the 1168��$ content missing. (e.g. a misinterpreted comment or missing closing quote) 1169$#</li> <li> 1170��$- SE_FAIL: This isn't really a syntax error. It is not used inside 1171��$ parse_syntax() itself; it's only set before returning when a file loading 1172��$ error was detected, for the sake of the calling function. 1173$#</li> <li> 1174��$- SE_NODATA: Similar to SE_FAIL. This is set if EOF is returned by load() 1175��$ before *any* data has been read. 1176$#</li> </ul> 1177 1178After the whole page is parsed, a warning message is printed if some error was 1179found. The message text depends on the error level. The error level is also 1180passed back to main(), which then waits for a keypress before starting the 1181pager, so the message will be seen. 1182 1183In IGNORE_BROKEN mode the warning is suppressed, and "err_level" is reset. In 1184BROKEN_HTML mode, all but SE_CRITICAL errors are suppresed; and in VALID_HTML, 1185all but SE_CRITICAL and SE_WORKAROUND. 1186 1187SE_BREAK is set if EOF is returned, but at the same time "input->user_break" 1188has been set, indicating that it's not really EOF, but transfer was interrupted 1189by the user. Other errors are supressed in this mode, as a user break during 1190loading might cause several syntax errors (unclosed elements etc.) with the 1191page itself being not to blame for. 1192 1193SE_NODATA is set if EOF is returned by load() before any data has been read. 1194(This can be caused by failure to open the resource, but also by an empty 1195file/http response.) It's handled like a normal syntax error; the only 1196difference is that it can't be masked even by IGNORE_BROKEN. The syntax tree 1197consists only of the global element; it will be correctly rendered to an empty 1198page. 1199 1200SE_FAIL isn't set during parsing. Before returning, parse_syntax() checks 1201whether "input->type" is RES_FAIL; if it is, an error message is printed, and 1202SE_FAIL is set (so main() knows an error occured). However, this test is only 1203necessary if SE_NODATA isn't set; otherwise, an error message has already been 1204printed and an error code would be returned anyways. As EOF is returned by 1205load() also when an error occured, SE_NODATA is already set for most errors; 1206SE_FAIL is only used if the error occurs after some data could be read. 1207 1208$#</a> <!-- warn --> 1209 1210$#<a name="freeSyntax" id="freeSyntax"> 1211 1212$=$$<h3>$$$_free_syntax()$_$$</h3>$$ 1213 1214This function is responsible for freeing the memory used by the syntax tree 1215when it is no longer needed. 1216 1217The whole tree is traversed by "list_next", and the element nodes are freed one 1218by one. 1219 1220As the "list_next" pointer is necesary to find the next node, but not longer 1221available after freeing the current node, it is saved in "next_el" before 1222freeing. At the the beginning of the next iteration this is copied to "cur_el". 1223 1224Before freeing the element node itself, all dynamic data belonging to the node 1225has to be freed. 1226 1227$#</a> <!--freeSyntax--> 1228 1229$#</a> <!--parseSyntax--> 1230 1231$#<a name="dumpTree" id="dumpTree"> 1232 1233$=$$<h2>$$$_3. dump-tree.c$_$$</h2>$$ 1234 1235dump_tree() is primarily used for dumping the syntax tree generated by 1236parse_sytax() for debugging purposes. The reason it resides in an own file is 1237that it could be easily modified to be a really useful function for dumping a 1238HTML document's structure. This may be implemented in the future, if someone 1239shows interest... 1240 1241The implementation of dump_tree() is quite straightforward, as the function 1242only needs to print every node in the order it occured in the HTML file. 1243 1244For every node, first the text is printed. (If "dump_content" is given.) The 1245reason it is printed in front of the node itself is because it's also the 1246content in front of the element in the original HTML file. 1247 1248Next, the current tree depth is shown by a number of '|'. The current depth is 1249always stored in "depth". 1250 1251Afterwards the element name is printed, and all attributes with their values. 1252Depending on "elements_parsed", either the raw values extracted from the 1253document are printed, or the transformed values generated by parse_elements(). 1254 1255The next node is reached by the "list_next" pointer of the current node. Before 1256going to the next node the tree depth of the new node needs to be calculated. 1257This is done by assuming that the next node is below the current one, and than 1258going up, until we find the parent of the new node. This idea is explained in 1259more detail in $$<a$+href="#parseStruct">$$$_6. parse-struct.c$_$$</a>$$$ . 1260 1261$#</a> <!--dumpTree--> 1262 1263$#<a name="parseElements" id="parseElements"> 1264 1265$=$$<h2>$$$_4. parse-elements.c$_$$</h2>$$ 1266 1267parse_elements() is responsible for making out the elements and attributes from the 1268syntax tree. (Extraced by parse_syntax().) 1269 1270All elements but the first one (which is always ELEMENT_GLOBAL) are processed one 1271by one; the tree is traversed via "list_next". For every element, the name is 1272looked up in "element_table[]" by comparing to all entries, in a loop. 1273 1274$#<a name="elementTable" id="elementTable"> 1275 1276"element_table[]" contains all names of ordinary elements, then the "?" 1277representing ELEMENT_NO, and finally "!" representing ELEMENT_GLOBAL. (It also contains 1278other properties of the elements; more on this in 1279$$<a$+href="#parseElementsProcessing">$$$_Processing$_$$</a>$$ in 1280$$<a$+href="#parseStruct">$$$_6. parse-struct.c$_$$</a>$$$ .) 1281 1282$#</a> <!--elementTable--> 1283 1284The last two aren't checked against the element name. 1285 1286As soon as a match is found, the loop is left and the entry number is stored to 1287"syntax_tree" in place of the string. The entry number is an "enum 1288Element_type", defined in syntax.h; it tells the element type in the following 1289processing passes. 1290 1291If none of the ordinary entries matched, the entry nuber, which now is ELEMENT_NO 1292(as this one follows after the ordinary entries), is stored anyhow, indicating 1293that the element is unknown. If no element name was stored ("cur_el->name.str" 1294is NULL), indicating a dummy tag, ELEMENT_NO is set also. 1295 1296After the element name, all attribute names are processed in a loop. They are 1297looked up in "attr_table[]" the same way the element name is. 1298 1299The attribute value isn't processed at all yet. 1300 1301$#</a> <!--parseElements--> 1302 1303$#<a name="sgmlRework" id="sgmlRework"> 1304 1305$=$$<h2>$$$_5. sgml_rework()$_$$</h2>$$ 1306 1307Before the syntax tree is further processed, sgml_rework() (from sgml.c) is 1308applied. (Unless compiled with -DXHTML_ONLY.) 1309 1310This function is responsible for fixing the problems arising from the fact that 1311SGML allows certain end tags to be left out; thus the syntax parser doesn't 1312recognize the elements' ends, and stores all following elements as children, 1313even if they should actually be at the same level. (e.g. the list items in a 1314list.) sgml_rework() goes over the complete (broken) tree, finds such 1315situations, and unnests the elements, thus creating a correct syntax tree. 1316 1317It won't be covered in too much detail here, as this is only a temporary 1318solution; it will become obsolete with the planned new parser(s). 1319 1320The recognition of the missing element ends is done by the "element_group" enum 1321in "$$<a$+href="#elementTable">$$element_table[]$$</a>$$". This Enum has the 1322values GROUP_SINGLE for single tag elements (elements which mustn't have any 1323content), GROUP_OBLIGATE for all elements where the end tag can't be left out, 1324and several others for various kinds of elements with optional end tag. 1325 1326The whole tree is scanned element by element (using "list_next"), and each one 1327is tested to fullfill one of the offending conditions. Two things have to be 1328handled: Unclosed single tag elements, and unclosed optional end tag elements. 1329 1330The second situation is more complicated. If the element is of some type with 1331optional end tag, it could terminate a previous (unclosed) element from the 1332same group; e.g. a <li> will terminate the previous <li>. It doesn't terminate 1333elements from other groups, though; a <td> inside a <tr> doesn't terminate the 1334<tr>, for example. Thus, the group of the current element needs to be tested 1335against the group of the parent (all elements following an unclosed one are 1336stored as its children by the parser!); if they are the same, the parent is 1337actually an element that should be terminated at the position where the child 1338starts, and the child should follow it, at the same depth. This means that the 1339child has to be "lifted" out of the parent. However, we don't do that 1340immediately; we only set the "closed" flag of the parent element for now, and 1341the lifting will be done later. 1342 1343However, it's not enough to test only the immediate parent: The element may 1344follow some other unclosed element, and thus be a child of it, e.g. a <tr> 1345following a <td>, which is inside the previous <tr>. This also needs to be 1346recognized, and *both* the previous <tr> and the <td> have to be closed. Thus, 1347not only the immediate parent's group is compared, but all ancestors are 1348scanned. The scanning only stops on an element with obligate end tag -- as the 1349element's end is always known for these, nothing will be ever stored inside 1350that element that doesn't belong there, and nothing should be lifted out. (In 1351nested tables for example, the <tr>s and <td>s of the inner table shouldn't 1352mess with the ones of the outer table -- this is ensured by the scanning of the 1353inner table's rows and columns stopping at the inner <table> element.) The 1354"closed" flag is set for all the closed ancestors. 1355 1356Some more handling is necessary due to the fact that an element node always 1357stores the content which appears *before* the element. When an element is 1358lifted, the content musn't be lifted also -- it appeared *before* the element, 1359and thus also before the previous element's end, so it has to stay where it is. 1360We have to create a new dummy tag inside the closed element therefore, taking 1361the place of the lifted element and storing its content. 1362 1363However, this isn't done when the parent was already closed. This happens if 1364the parent is a single tag element. These elements end right where they start, 1365not at the beginning of the next element; the content also has to be lifted 1366out. (Nothing is allowed to stay inside a single tag element!) 1367 1368The actual lifting is done after processing the element: If the parent is 1369closed, we have to "leave" it. (This is done by setting the "parent" to the 1370previous grandparent -- this way, the element is no longer a child of the old 1371parent, but a sibling.) Thus the element that closed it's parent is lifted 1372right afterwards; all following elements of the parent will be lifted also, 1373after being processed. Any preceeding elements (as well as the possibly created 1374new dummy) won't be lifted on the other hand, as they won't be processed 1375anymore. 1376 1377Single tag elements are handled more or less the other way round: They are not 1378closed by some child (which turns out to actually be a sibling), but close 1379*themselfs*, as soon as they are encountered. This way all children will be 1380lifted out, no matter what. 1381 1382No other processing is necessary for single tag elements, as they won't ever 1383terminate some other element. 1384 1385$#</a> <!-- sgmlRework --> 1386 1387$#<a name="parseStruct" id="parseStruct"> 1388 1389$=$$<h2>$$$_6. parse-struct.c$_$$</h2>$$ 1390 1391After the syntax tree was generated by parse_syntax(), we have to "understand" 1392it. This is done by parse_struct(), which is the central pass of the layouting 1393process. In this function the syntax tree, which contains a nearly 1:1 1394reproduction of the HTML file, is converted to an item tree, which contains a 1395representation of what will be actually shown as the output of the browser -- 1396text blocks, blank rows, boxes grouping severel other items. 1397 1398$=$$<h3>$$$_Structure Tree$_$$</h3>$$ 1399 1400For 0.html, we have to convert the syntax tree: 1401 1402$� ++>NULL 1403$� + 1404$�+---+ 1405$�| ! |-. <++ 1406$�+---+ | + 1407$� v + 1408$� ,-----------. 1409$�("header text") 1410$� `-+------+--' 1411$� | html |-. <++++++++ 1412$� +------+ | <+ + 1413$� v + + 1414$� +------+ +------+ <+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1415$� | head |->| body |-. <++++++++++++++ + 1416$� +------+ +------+ | <+ + + 1417$� v + + + <+++++++++++++++++++++++++++++++++++++++++++++++++ 1418$� +----++ +---+ <++++++++++++++++++++ +---+ <++++++++++++++++++++++++++++++++ + 1419$� | h1 |-. <+ ,->| p |---. <++ + ,->| p |-. <++++++++++++++++++ + + (back 1420$� +----+ | + | +---+ | + + | +---+ | <+ + + + to top) 1421$� v + | v + + | | + + + + ^ 1422$� ,--------. | ,------------------. ,----. | | + ,----------. ,---------. ,---------------. | 1423$� (" heading") | (" first...newlines,") (" and") | v + (" this...em") ("ded...row") (" (a single tag)") | 1424$� `--+---+-' | `-----+----+-------' +-+----+-+ | +--------+ `---+---+--' `--+----+-' `-----+---+-----' | 1425$� | ? |----' | em |-. <++ ,->| strong |-. <++ | | center |-. <++ ,->| a |-. <+ ,->| br |---------->| ? |--------' 1426$� +---+ +----+ | + | +--------+ | + | +--------+ | + | +---+ | + | +----+ +---+ 1427$� v + | v + | v + | v + | 1428$� ,----------------. | ,------------. | ,----------------. | ,---.+ | 1429$� (" emphasized text") | (" strong text") | (" starting...tag,") | ("bed") | 1430$� `--------+---+---' | `-----+---+--' | `--------+---+---' | +---+ | 1431$� | ? |------' | ? |-----' | ? |------' | ? |--' 1432$� +---+ +---+ +---+ +---+ 1433 1434to this structure tree: 1435 1436$#<a name="itemTree" id="itemTree"> 1437 1438$�***> "string" (back <+++++++++ 1439$�xxx> "first_child" to first + 1440$�+++> "parent" item) +-----+ 1441$�===> "next" ,->| box |-->NULL 1442$�---> "list_next" | +-----+<==# 1443$� | x ^ #===# 1444$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 1445$� x | + 1446$� x ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1447$� x + + + + + + + + + | 1448$� v + + + + + + + + + | 1449$�+------+ +-------+ +------+ +-------+ +------+ +-------+ +------+ +------+ +------+ | 1450$�| text |-->| blank |-->| text |-->| blank |-->| text |-->| blank |-->| text |-->| text |-->| text |-' 1451$�+----*-+==>+-------+==>+----*-+==>+-------+==>+----*-+==>+-------+==>+----*-+==>+----*-+==>+----*-+==>NULL 1452$� x * x x * x x * x x * x * x * 1453$� v * v v * v v * v v * v * v * ,--------------. 1454$�NULL * NULL NULL * NULL NULL * NULL NULL * NULL * NULL **>("(a single tag)") 1455$� v v v v * `--------------' 1456$� ,-----------. ,-------. ,------------. ,---------------. * ,----------------. 1457$�("header text") ("heading") ("first...text") ("starting...tag,") **>("this...blank row") 1458$� `-----------' `-------' `------------' `---------------' `----------------' 1459 1460$#</a> <!--itemTree--> 1461 1462which, in turn, is a representation of this output page: 1463 1464$#<a name="itemPage" id="itemPage"> 1465 1466$�+-------------------------------------------------------------------------------------------------+ 1467$�|+-----------+ | 1468$�||header text| | 1469$�|+-----------+ | 1470$�| | 1471$�|+-------+ | 1472$�||heading| | 1473$�|+-------+ | 1474$�| | 1475$�|+-----------------------------------------------------------------------------------------------+| 1476$�||first paragraph of text; includes multiple spaces and newlines, emphasized text and strong text|| 1477$�|+-----------------------------------------------------------------------------------------------+| 1478$�| | 1479$�|+---------------------------------+ | 1480$�||starting with an evil center tag,| | 1481$�|+---------------------------------+ | 1482$�|+-----------------------------------------------------------------------------------------------+| 1483$�||this very long second paragraph contains some special characters (including a simple space...):|| 1484$�|| &; <>"=/ plus a big gap###and two unicode escapes (decimal: � and hexal: �) but also an anchor|| 1485$�|| embedded inside a word (this anchor also is the only tag with parameters); and finally a blank|| 1486$�|| row || 1487$�|+-----------------------------------------------------------------------------------------------+| 1488$�|+--------------+ | 1489$�||(a single tag)| | 1490$�|+--------------+ | 1491$�+-------------------------------------------------------------------------------------------------+ 1492 1493$#</a> <!--itemPage--> 1494 1495Note that there are no actual sizes or postions for the items, and no line 1496breaks inside the text items; this is all done in a later processing step 1497(pre-render.c). The item tree at this point only represents the structure of 1498the output page. (The line breaks in the fifth text block aren't really there; 1499we have inserted them in the figure because the text block is a bit too long to 1500put in a single line...) 1501 1502The item tree looks complicated at first, but it's a quite trivial example when 1503taking a closer look. (This is becuase at the time of creating the 0.html file 1504used here, netrik wasn't able to do anything more complicated...) However, it 1505should be sufficient to get the idea... 1506 1507Every node of the item tree consists of an "Item" structure. This structure is 1508declared in "items.h". It contains: 1509 1510$#<ul> <li> 1511��$- The pointers "list_next", "next", "parent", and "first_child" connect the 1512��$ nodes inside the tree. "list_next" points to the next node in the order they 1513��$ are generated. Note that in contrast to the element tree, in the item tree 1514��$ any children are generated *before* the parent. "next" points to the next 1515��$ item at the same tree depth and in the same branch, i.e. the next sibling. 1516��$ "parent" points to the parent item, and "first_child" to the first sub-item. 1517��$ More pointers are necessary than for the element tree, because the item tree 1518��$ is traversed in several differnt ways while processing. 1519$#</li> <li> 1520��$- The "center" flag indicates whether the item is centered. As not all items 1521��$ use this, and the exact meaning varies between different item types, it may 1522��$ be reasonable to move this to the item specific data. We'll decide on this as 1523��$ soon as enought HTML facilities are implemented. 1524$#</li> <li> 1525��$- "x_start", "x_end", "y_start" and "y_end" define a square area inside the 1526��$ layouted page, in which the item is displayed. In some processing steps 1527��$ "x_end" end "y_end" are also "abused" to store the minimal size of the item. 1528$#</li> <li> 1529��$- "type" is an enum storing what kind of item this node represents. (Currently 1530��$ ITEM_TEXT, ITEM_BOX, ITEM_FORM, ITEM_BLANK, ITEM_BLOCK_ANCHOR or 1531��$ ITEM_INLINE_ANCHOR.) 1532$#</li> <li> 1533��$- "data" is an union storing all data specific to different item types. 1534��$ Currently this can be the pointer to a text string for text items, to an 1535��$ (block or inline) anchor struct, or to a form paramters struct. 1536$#</li> </ul> 1537 1538$=$$<h3>$$$_add_item()$_$$</h3>$$ 1539 1540New items are created by add_item(). This function allocates a new "Item" 1541structure, and sets some pointers. The item isn't inserted into the tree 1542directly; it's only inserted into a list of all items at the current tree 1543depth. This is a single linked list maintained by the "first_item" and 1544"last_item" pointers of "state" (there is one such list for each tree depth), 1545and the "next" pointers of the item structures. The only other pointers set are 1546"list_next". The "parent" and "first_child" pointers aren't set; this is done 1547later when the items are actually inserted into the tree, while ascending from 1548the current depth. (For the first item in every tree depth the "parent" is 1549explicitly set to NULL, indicating that there is no parent yet.) 1550 1551This function is called directly to create box items, and from add_string() to 1552create text items. Under certain conditions it also creates a blank item before 1553the actual text item or box item; more on this later, under 1554$$<a$+href="#blankLines">$$$_Blank Lines$_$$</a>$$$ . 1555 1556When called with the "virtual" flag, this function behaves slightly different: 1557No line break/blank line handling is done; the status remains unchanged. This 1558is for creating the $$<a$+href="#virtual">$$$_Virtual Boxes$_$$</a>$$ used for 1559anchors. 1560 1561$#<a name="string" id="string"> 1562 1563$=$$<h3>$$$_String$_$$</h3>$$ 1564 1565The actual text data of text items is stored in a different place. Every text 1566item points to a "String" structure (also declared in "items.h"). This 1567structure consists of a normal C-string containing the text itself, and an 1568array of "Div" structures, containing all attribute information. (Color etc.) A 1569"String" can consist of several divisions with different attributes. Every 1570"Div" structure stores the attributes for one such division, and the ending 1571position of the division inside the string. (More exactly: the position *after* 1572the end of the div -- which is the starting position of the next div.) The end 1573of the last division is also used to find out the string length. This is quite 1574inefficient... 1575 1576The "String" structure also contains "line_table[]", which holds the positions 1577of all line breaks inside the string; more on this in $$<a 1578href="#preRender">$$$_7. pre-render.c$_$$</a>$$$ . 1579 1580Finally, it contains an array of 1581"$$<a$+href="hacking-links.html#linkStruct">$$Link$$</a>$$" structures, which 1582describe all the links (and form elements) inside this text block. See 1583$$<a$+href="hacking-links.html">$$hacking-links.*$$</a>$$ for this. 1584 1585$#</a> <!-- string --> 1586 1587$#<a name="parseElementsProcessing" id="parseElementsProcessing"> 1588 1589$=$$<h3>$$$_Processing$_$$</h3>$$ 1590 1591Many properties of the various element types are data-driven. This presently 1592includes the line break/blank line handling, elements creating a box around all 1593children, and elements whose content is not to be rendered. This properties are 1594stored to the same "$$<a$+href="#elementTable">$$element_table[]$$</a>$$" as 1595the name strings used by parse_elements(). In future probably more properties 1596will be data driven, making the code simpler, and also necessary for handling 1597style sheets, which allow changing of almost all formatting properties. 1598 1599The item tree is generated while traversing the element tree. Processing of 1600each element is done in two steps: One step is done before entering an element 1601(descending in the element tree), and the second step is done after leaving the 1602element (ascending). Between those two steps, the same is done for all 1603sub-elements. You guess it: This is a recursive algorithm. Only we haven't 1604implemented it recursively, as mentioned in chapter 0. In this function the 1605pseudo-recursive implementation is most evident; we even have to use a 1606pseudo-stack. 1607 1608The processing could also be split into a couple of much simpler passes using 1609some temporary data structures, e.g. one generating the "normal" items, one 1610generating blank items, one generating the strings, one "optimizing" the tree 1611(lifting items where possible). This would be much easier to understand, and 1612probably it would have been a good idea for the beginning. However, it would be 1613much less efficient; that's why we won't step back to such an implementation 1614after already having the present one. 1615 1616$=$$<h4>$$$_Pre-processing$_$$</h4>$$ 1617 1618The first action in every iteration of the outer loop is to do the first 1619processing step (s.a.) for the current element ("cur_el") -- in every iteration 1620exactly one element is pre-processed. 1621 1622$� ++>NULL 1623$� + 1624$�+---+ 1625$�| ! |-. <++ 1626$�+---+ | + 1627$� v + 1628$� ,-----------. 1629$�("header text") 1630$� `-+------+--' <-- cur_el <== depth 1631$� | html |-. <++++++++ 1632$� +------+ | <+ + 1633$� v + + 1634$� +------+ +------+ 1635$� | head |->| body |-. 1636$� +------+ +------+ | 1637 1638First, we store any text from the current element node to the current open text 1639item by add_string(). 1640 1641$�state[0]->first_item -->NULL (depth 0) 1642$�state[0]->last_item **>NULL 1643$�--- 1644$�+------+ <-- first_item 1645$�| text |-->NULL <** last_item <== depth=1 1646$�+----*-+==>NULL 1647$� x * 1648$� v * 1649$�NULL * 1650$� v 1651$� ,-----------. 1652$�("header text") 1653$� `-----------' 1654 1655The next thing is processing of line breaks and paragraph breaks, depending 1656what kind of element we have. More on this later, in $$<a 1657href="#textBlocks">$$$_Text Blocks$_$$</a>$$ and $$<a 1658href="#blankLines">$$$_Blank Lines$_$$</a>$$$ . 1659 1660$=$$<h4>$$$_Recursing$_$$</h4>$$ 1661 1662Then we recurse into the element (descend in the element tree). 1663 1664$� ++>NULL 1665$� + 1666$�+---+ 1667$�| ! |-. <++ 1668$�+---+ | + 1669$� v + 1670$� ,-----------. 1671$�("header text") 1672$� `-+------+--' <-- cur_el 1673$� | html |-. <++++++++ 1674$� +------+ | <+ + 1675$� v + + 1676$� +------+ +------+ 1677$� | head |->| body |-. <== depth 1678$� +------+ +------+ | 1679 1680$�state[0]->first_item -->NULL (depth 0) 1681$�state[0]->last_item **>NULL 1682$�--- 1683$�+------+ <-- state[1].first_item (depth 1) 1684$�| text |-->NULL <** state[1].last_item 1685$�+----*-+==>NULL 1686$� x * 1687$� v * 1688$�NULL * 1689$� v 1690$� ,-----------. 1691$�("header text") 1692$� `-----------' 1693$�--- 1694$�first_item -->NULL <== depth=2 1695$�last_item **>NULL 1696 1697This is done by push_state(). This function doubles the top of stack, and 1698returns a pointer to the newly created entry, which is used as the current 1699state. ("first_item" and "last_item" aren't copied, but set to NULL; "id_attr" 1700and "link_type" are set to -1.) 1701 1702The stack stores all variables which are specific to every tree depth. Note 1703that the stack uses the depths from the element tree, not from the item tree. 1704The depths in the item tree are completely different, and change while 1705processing -- which is one of the most tricky parts about parse_struct(). 1706 1707Currently the data stored is: 1708 1709$#<ul> <li> 1710��$- Visibilty of element's content ("visible") 1711$#</li> <li> 1712��$- Text attributes ("text_mode" and "high"). 1713$#</li> <li> 1714��$- The nesting depth "list_depth" of item lists. (Determines the indent of list 1715��$ items.) 1716$#</li> <li> 1717��$- The two pointers "first_item" and "last_item", necessary to maintain the list 1718��$ of all items at a given depth. 1719$#</li> <li> 1720��$- The type of the link or form control created by the element in the 1721��$ "link_type" enum (a value of -1 indicates there is no link at all) 1722$#</li> <li> 1723��$- The URL of links or value of form elements in the "link_value" string 1724$#</li> <li> 1725��$- The "form_enabled" flag indicates whether a form element is to be submitted to the 1726��$ server 1727$#</li> <li> 1728��$- The name of a possible <select> element, bequeathed to its <option> elements 1729$#</li> <li> 1730��$- The kind of the <select> element, also bequeathed 1731$#</li> <li> 1732��$- "link_start" stores the position where a link or inline anchor beginns inside 1733��$ the current string 1734$#</li> <li> 1735��$- "link_item" stores the text item in which a link/anchor beginns 1736$#</li> <li> 1737��$- For elements with an anchor, "id_attr" stores which attribute contains the 1738��$ anchor id (or name) 1739$#</li> </ul> 1740 1741After descending, first some generic processing is performed. 1742 1743This includes setting "cur_state->visible" depending on the parent element 1744type's "visible" property. 1745 1746Also, "link_start" is set to the current string end, so any text generated 1747inside this element will become part of the link or anchor, if the element 1748creates one. If there is no string item open, 0 is stored, so the link will 1749begin at the start of the string if a new string beginns inside the element. 1750 1751Aferwards, some element type specific handling is done. Mostly this is 1752outputing of special element indicators. (This could be made data-driven, and 1753probably it will do so soon.) 1754 1755For some element types, also values of the current state are modified; this 1756would be the argument passing in a real recursive implementation. 1757 1758$=$$<h4>$$$_Ascending$_$$</h4>$$ 1759 1760The last step of every outer loop iteration is returning from recursion 1761(ascending in the element tree). But in contrast to descending, ascending isn't 1762done once per outer loop iteration. Instead, there is an inner loop, that 1763ascends as often as needed to reach the level of the next element -- this can 1764be zero, once or several times. 1765 1766We know how long we need to ascend by starting at the level of the current 1767element and looking for the parent of the new element; as soon as we find it, 1768we know we needn't ascend any more. "depth" is adjusted every time we ascend, 1769and all actions for leaving the element (unrecursing) are taken. ("depth" is 1770always one below the depth of "new_el".) 1771 1772If the next element is a child of the current one, no ascending is necessary. 1773We have descended one step in the pre-processing step of the current iteration, 1774and this is ok; we keep it. ("cur_el" is alredy the parent of "list_next".) 1775 1776$� ,-----------. 1777$�("header text") 1778$� `-+------+--' 1779$� | html |-. <++++++++ 1780$� +------+ | <+ + 1781$� v + + 1782$� +------+ +------+ 1783$� | head |->| body |-. <++++++++++++++ 1784$� +------+ +------+ | <+ + 1785$� v + + 1786$� +----++ +---+ 1787$� cur_el --> | h1 |-. <+ ,->| p |-> 1788$� new_el xx> +----+ | + | +---+ 1789$� v + | 1790$� ,--------. | 1791$� (" heading") | 1792$� `--+---+-' | <== depth 1793$� list_next **> | ? |----' 1794$� +---+ 1795 1796If the next element is at the same level as the current one (single tags or 1797other elements with no sub-elements), we need to ascend exactly one time, to 1798get back to the level of the current element, after we have descended in the 1799pre-processing step. 1800 1801$� ,-----------. 1802$�("header text") 1803$� `-+------+--' 1804$� | html |-. <++++++++ 1805$� +------+ | <+ + 1806$� v + + 1807$� +------+ +------+ <** 1808$� --> | head |->| body |-. <++++++++++++++ 1809$� xx> +------+ +------+ | <+ + 1810$� v + + 1811$� +----++ +---+ 1812$� | h1 |-. <+ ,->| p |-> <== 1813$� +----+ | + | +---+ 1814$� v + | 1815$� ,--------. | 1816$� (" heading") | 1817$� `--+---+-' | 1818$� | ? |----' 1819$� +---+ 1820 1821$� ,-----------. 1822$�("header text") <xx 1823$� `-+------+--' 1824$� | html |-. <++++++++ 1825$� +------+ | <+ + 1826$� v + + 1827$� +------+ +------+ <** 1828$� --> | head |->| body |-. <++++++++++++++ <== 1829$� +------+ +------+ | <+ + 1830$� v + + 1831$� +----++ +---+ 1832$� | h1 |-. <+ ,->| p |-> 1833$� +----+ | + | +---+ 1834$� v + | 1835$� ,--------. | 1836$� (" heading") | 1837$� `--+---+-' | 1838$� | ? |----' 1839$� +---+ 1840 1841Of course it's a bit of overhead first to descend into an element, just to 1842ascend from it right after. But it saves a lot of code for special handling of 1843such childless elements. Many of the actions of both the first step and the 1844second step have to be done for them also -- putting them together and leaving 1845out the descending and ascending wouldn't save that much, while complicating 1846the code quite a lot. We may consider some way in the future if profiling shows 1847this would be rewarding. 1848 1849If the next element is above the current one, we have to ascend more than once. 1850(Once to get back to the level of the current one, the others to ascend to the 1851new level.) 1852 1853$� ,-----------. 1854$�("header text") 1855$� `-+------+--' 1856$� | html |-. <++++++++ 1857$� +------+ | <+ + 1858$� v + + 1859$� +------+ +------+ 1860$� | head |->| body |-. <++++++++++++++ 1861$� +------+ +------+ | <+ + 1862$� v + + 1863$� +----++ +---+ 1864$� | h1 |-. <+ ,->| p |-> <** 1865$� +----+ | + | +---+ 1866$� v + | 1867$� ,--------. | 1868$� (" heading") | 1869$� `--+---+-' | 1870$� --> | ? |----' 1871$� xx> +---+ 1872$� 1873$� <== 1874 1875$� ,-----------. 1876$�("header text") 1877$� `-+------+--' 1878$� | html |-. <++++++++ 1879$� +------+ | <+ + 1880$� v + + 1881$� +------+ +------+ 1882$� | head |->| body |-. <++++++++++++++ 1883$� +------+ +------+ | <+ + 1884$� v + + 1885$� +----++ +---+ 1886$� xx> | h1 |-. <+ ,->| p |-> <** 1887$� +----+ | + | +---+ 1888$� v + | 1889$� ,--------. | 1890$� (" heading") | 1891$� `--+---+-' | <== 1892$� --> | ? |----' 1893$� +---+ 1894 1895$� ,-----------. 1896$�("header text") 1897$� `-+------+--' 1898$� | html |-. <++++++++ 1899$� +------+ | <+ + 1900$� v + + 1901$� +------+ +------+ <xx 1902$� | head |->| body |-. <++++++++++++++ 1903$� +------+ +------+ | <+ + 1904$� v + + 1905$� +----++ +---+ 1906$� | h1 |-. <+ ,->| p |-> <** <== 1907$� +----+ | + | +---+ 1908$� v + | 1909$� ,--------. | 1910$� (" heading") | 1911$� `--+---+-' | 1912$� --> | ? |----' 1913$� +---+ 1914 1915In every ascending iteration, first we do some element type specific handling 1916again, mostly outputing element end indicators. Then we pop the previous state 1917from the stack. 1918 1919$� ++>NULL 1920$� + 1921$� + .--------------------------------. 1922$� + v | 1923$�+---+ <xx | 1924$�| ! |-. <++ <** | 1925$�+---+ | + | 1926$� v + | 1927$� ,-----------. | 1928$�("header text") | <== 1929$� `-+------+--' | 1930$� | html |-. <+++++++++++ | 1931$� +------+ | + | 1932$� | [...] [...] 1933$� [...] + | 1934$� | ,---------------. | 1935$� | (" (a single tag)") | 1936$� | `-----+---+-----' | <-- 1937$� `--------->| ? |--------' 1938$� +---+ 1939 1940$�state[0]->first_item -->NULL (depth 0) 1941$�state[0]->last_item **>NULL 1942$�--- ,-- first_item 1943$�+------+ <--' +------+ +------+ <** last_item 1944$�| text |--[...]-->| text |-->| text |-->NULL <== depth=1 1945$�+----*-+==[...]==>+----*-+==>+----*-+==>NULL 1946$� x * x * x * 1947$� v * v * v * ,--------------. 1948$�NULL * NULL * NULL **>("(a single tag)") 1949$� v * `--------------' 1950$� ,-----------. * ,----------------. 1951$�("header text") **>("this...blank row") 1952$� `-----------' `----------------' 1953 1954$� ++>NULL <xx 1955$� + 1956$� + .--------------------------------. 1957$� + v | 1958$�+---+ | 1959$�| ! |-. <++ <** | <== 1960$�+---+ | + | 1961$� v + | 1962$� ,-----------. | 1963$�("header text") | 1964$� `-+------+--' | 1965$� | html |-. <+++++++++++ | 1966$� +------+ | + | 1967$� | [...] [...] 1968$� [...] + | 1969$� | ,---------------. | 1970$� | (" (a single tag)") | 1971$� | `-----+---+-----' | <-- 1972$� `--------->| ? |--------' 1973$� +---+ 1974 1975$�first_item -->NULL <== depht=0 1976$�last_item **>NULL 1977$�--- ,-- state[1].first_item 1978$�+------+ <--' +------+ +------+ <** state[1].last_item (depth 1) 1979$�| text |--[...]-->| text |-->| text |-->NULL 1980$�+----*-+==[...]==>+----*-+==>+----*-+==>NULL 1981$� x * x * x * 1982$� v * v * v * ,--------------. 1983$�NULL * NULL * NULL **>("(a single tag)") 1984$� v * `--------------' 1985$� ,-----------. * ,----------------. 1986$�("header text") **>("this...blank row") 1987$� `-----------' `----------------' 1988 1989$=$$<h4>$$$_Inserting into Item Tree$_$$</h4>$$ 1990 1991Afterwards, the probably most interesting part follows: The sub-items created 1992inside the element we are just leaving, are inserted into the item tree 1993properly. 1994 1995If the element we are leaving enforces a box (looked up in "element_table[]"), a new 1996box item is created. (Box items are always created when leaving the element, 1997and thus *after* all items inside the box.) 1998 1999$� --> +-----+ 2000$� **> ,->| box |-->NULL <== depht=0 2001$� | +-----+==>NULL 2002$� | 2003$� | 2004$� | 2005$� | 2006$� | 2007$� | 2008$�+------+ <-- +------+ +------+ | <** (depth 1) 2009$�| text |--[...]-->| text |-->| text |-' 2010$�+----*-+==[...]==>+----*-+==>+----*-+==>NULL 2011$� x * x * x * 2012$� v * v * v * ,--------------. 2013$�NULL * NULL * NULL **>("(a single tag)") 2014$� v * `--------------' 2015$� ,-----------. * ,----------------. 2016$�("header text") **>("this...blank row") 2017$� `-----------' `----------------' 2018 2019The "parent" pointers of all immediate children are set to the new box item, 2020and "first_child" of the new item is set to the first of them. 2021 2022$� --> +-----+ 2023$� **> ,->| box |-->NULL <== depht=0 2024$� | +-----+==>NULL 2025$� | x ^ 2026$� xxxxxxxx[...]xxxxxxxxxxxxxxxxxxxxxxxxxxx + 2027$� x | + 2028$� x +++++[...]++++++++++++++++++++++++++++++ 2029$� x + + + | 2030$� v + + + | 2031$�+------+ <-- +------+ +------+ | <** (depth 1) 2032$�| text |--[...]-->| text |-->| text |-' 2033$�+----*-+==[...]==>+----*-+==>+----*-+==>NULL 2034$� x * x * x * 2035$� v * v * v * ,--------------. 2036$�NULL * NULL * NULL **>("(a single tag)") 2037$� v * `--------------' 2038$� ,-----------. * ,----------------. 2039$�("header text") **>("this...blank row") 2040$� `-----------' `----------------' 2041 2042If the element does not create a box, things are more tricky: We have to "lift" 2043all sub-elements to the new level. This is done by concatenating the list of 2044elements of the depth we are leaving to the list of elements of the depth we 2045are entering. 2046 2047$�state[0]->first_item -->NULL (depth 0) 2048$�state[0]->last_item **>NULL 2049$�--- 2050$�+------+ <-- state[1].first_item (depth 1) 2051$�| text |-->NULL <** state[1].last_item 2052$�+----*-+==>NULL 2053$� x * 2054$� v * 2055$�NULL * 2056$� v 2057$� ,-----------. 2058$�("header text") 2059$� `-----------' 2060$�--- 2061$�state[2]->first_item -->NULL (depth 2) 2062$�state[2]->last_item **>NULL 2063$�--- ,-- state[3].first_item 2064$�+------+<-'+-------+ <** state[3].last_item (depth 3) 2065$�| text |-->| blank |-->NULL 2066$�+----*-+==>+-------+==>NULL 2067$� x * x 2068$� v * v 2069$�NULL * NULL 2070$� v 2071$� ,-------. 2072$�("heading") 2073$� `-------' 2074$�--- 2075$�+------+ <-- first_item 2076$�| text |-->NULL <** last_item <== depth=4 2077$�+----*-+==>NULL 2078$� x * 2079$� v * 2080$�NULL * 2081$� v 2082$� ,------------. 2083$�("first...text") 2084$� `------------' 2085 2086$� +------+ <+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2087$�->| body |-. <++++++++++++++ + 2088$� +------+ | <+ + + 2089$� xx^ v + + + 2090$� +----++ +---+ <++++++++++++++++++++ +---+ 2091$�==> | h1 |-. <+ ,->| p |---. <++ + **> ,->| p |-> 2092$�depth=3 +----+ | + | +---+ | + + | +---+ 2093$� v + | v + + | 2094$� ,--------. | ,------------------. ,----. | 2095$� (" heading") | (" first...newlines,") (" and") | 2096$� `--+---+-' | `-----+----+-------' +-+----+-+ | 2097$� | ? |----' | em |-. <++ ,->| strong |-. <++ | 2098$� +---+ +----+ | + | +--------+ | + | 2099$� v + | v + | 2100$� ,----------------. | ,------------. | 2101$� (" emphasized text") | (" strong text") | 2102$� `--------+---+---' | `-----+---+--' | 2103$� | ? |------' --> | ? |-----' 2104$� +---+ +---+ 2105 2106$�state[0]->first_item -->NULL (depth 0) 2107$�state[0]->last_item **>NULL 2108$�--- 2109$�+------+ <-- state[1].first_item (depth 1) 2110$�| text |-->NULL <** state[1].last_item 2111$�+----*-+==>NULL 2112$� x * 2113$� v * 2114$�NULL * 2115$� v 2116$� ,-----------. 2117$�("header text") 2118$� `-----------' 2119$�--- 2120$�state[2]->first_item -->NULL (depth 2) 2121$�state[2]->last_item **>NULL 2122$�--- ,-- first_item 2123$�+------+<-'+-------+ +------+ <** last_item 2124$�| text |-->| blank |-->| text |-->NULL <== depth=3 2125$�+----*-+==>+-------+==>+----*-+==>NULL 2126$� x * x x * 2127$� v * v v * 2128$�NULL * NULL NULL * 2129$� v v 2130$� ,-------. ,------------. 2131$�("heading") ("first...text") 2132$� `-------' `------------' 2133 2134In the second case, we need an additional step if the element we are leaving 2135was a <center> element (or some element with an align="center" attribute): In 2136this case all sub-elements have to be centered one by one. (Elements creating a 2137box are centered as a whole.) 2138 2139Finally, we have to do some more line break/blank line handling. 2140 2141$#</a> <!--parseElementsProcessing--> 2142 2143$#<a name="textBlocks"> 2144 2145$=$$<h3>$$$_Text Blocks$_$$</h3>$$ 2146 2147One single text block can be created by several elements. Every series of text 2148parts not interrupted by elements requirering line breaks (or blank lines) 2149around them is stored to a single text item; it can even contain newline 2150characters created by <br> elements. This is done by calling add_string() every 2151time text data is encountered. 2152 2153The color for the text added by some element is determined from the current 2154"state->text_mode" and "state->high". The normal color for the text mode is 2155looked up in "color_map[]", and than its bit 3 is negated if the text is 2156highlighted. This is more or less a hack; we'll have to replace this by some 2157serious attribute handling at some point... 2158 2159$=$$<h4>$$$_add_string()$_$$</h4>$$ 2160 2161This function concatenates the new text to the "string" of "string_item", which 2162is a global variable pointing to the current open text item. If the new text 2163has other attributes than the last division of the "string" so far, a new 2164division is created for the new text; otherwise, the text is simply added to 2165the last division. 2166 2167$�text_item->string: 2168$� div[0].end div[1].end div[2].end 2169$� v v v 2170$�text: "[...]newlines, emphasized text and" 2171$� `~~~~~v~~~~~~'`~~~~~~~v~~~~~~'`~v' 2172$� div[1].color=MAGENTA div[2].color=WHITE 2173$� div[0].color=WHITE 2174 2175$�+------------------------------------------------+ 2176$�|+-----------+ | 2177$�||header text| | 2178$�|+-----------+ | 2179$�| | 2180$�|+-------+ | 2181$�||heading| | 2182$�|+-------+ | 2183$�| | 2184$�|+----------------------------------+ | 2185$�||[...]newlines, emphasized text and| | 2186 2187$� ,------------------. ,----. 2188$�(" first...newlines,") (" and") 2189$� `-----+----+-------' +-+----+-+ 2190$� | em |-. <++ ,->| strong |-. <++ 2191$� +----+ | + | +--------+ | + 2192$� v + | v + 2193$� ,----------------. | ,------------. 2194$� (" emphasized text") | (" strong text") 2195$� `--------+---+---' | `-----+---+--' 2196$� | ? |------' --> | ? |--> 2197$� +---+ +---+ 2198 2199$�text_item->string: 2200$� div[0].end div[1].end div[2].end div[3].end 2201$� v v v v 2202$�text: "[...]newlines, emphasized text and strong text" 2203$� `~~~~~v~~~~~~'`~~~~~~~v~~~~~~'`~v'`~~~~~v~~~~' 2204$� div[1].color=MAGENTA div[3].color=STRONG WHITE 2205$� div[0].color=WHITE div[2].color=WHITE 2206 2207$�+------------------------------------------------+ 2208$�|+-----------+ | 2209$�||header text| | 2210$�|+-----------+ | 2211$�| | 2212$�|+-------+ | 2213$�||heading| | 2214$�|+-------+ | 2215$�| | 2216$�|+----------------------------------------------+| 2217$�||[...]newlines, emphasized text and strong text|| 2218 2219A new division can also be enforced, by an additional call of add_text() with 2220NULL as text before adding the next text part. (This is necessary to prevent 2221multiple consecutive links from being merged into one div, as link highlighting 2222is done div-wise.) 2223 2224Appending to an existing text item is only possible if the last created item 2225was a text item, and there was no breaking element. (Neither the last one, nor 2226the new one, nor anyone in between.) Otherwise, a new text item has to be 2227created, and a new "String" structure for it. This is the only place where new 2228text items are created. 2229 2230$�| | 2231$�|+---------------------------------+| 2232$�||starting with an evil center tag,|| 2233$�|+---------------------------------+| 2234 2235$� +---+ 2236$�->| p |-. <++++++++++++++++++ 2237$� +---+ | <+ + 2238$� | + + 2239$� | + ,----------. 2240$� v + (" this...em") 2241$� +--------+ `---+---+--' <-- 2242$� | center |-. <++ ,->| a |-> 2243$� +--------+ | + | +---+ 2244$� v + | 2245$� ,----------------. | 2246$� (" starting...tag,") | 2247$� `--------+---+---' | 2248$� | ? |------' 2249$� +---+ 2250 2251$�| | 2252$�|+---------------------------------+| 2253$�||starting with an evil center tag,|| 2254$�|+---------------------------------+| 2255$�|+---------------------+ | 2256$�||this very...anchor em| | 2257 2258If there is a space at the beginning of a string, it is discarded while 2259creating the string, for a text block always starts with a word. 2260 2261$#</a> <!--textBlocks--> 2262 2263$=$$<h3>$$$_Line Breaks$_$$</h3>$$ 2264 2265As soon as an element forcing a line break is either entered or left (there are 2266no elements creating a break only before or only behind it), "string_item" is 2267set to NULL, indicating that no more text can be added to the last text item, 2268and new text has to create a new one. 2269 2270$� +---+ 2271$�->| p |-. <++++++++++++++++++ 2272$� +---+ | <+ + 2273$� | + + 2274$� | + ,----------. 2275$� v + (" this...em") 2276$� +--------+ <xx `---+---+--' <** 2277$� | center |-. <++ ,->| a |-> 2278$� +--------+ | + | +---+ 2279$� v + | 2280$� ,----------------. | 2281$� (" starting...tag,") | 2282$� `--------+---+---' | <-- <== 2283$� | ? |------' 2284$� +---+ 2285 2286$�| | 2287$�|+---------------------------------+| 2288$�||starting with an evil center tag,|| 2289 2290$� +---+ <xx 2291$�->| p |-. <++++++++++++++++++ 2292$� +---+ | <+ + 2293$� | + + 2294$� | + ,----------. 2295$� v + (" this...em") 2296$� +--------+ `---+---+--' <** <== 2297$� | center |-. <++ ,->| a |-> 2298$� +--------+ | + | +---+ 2299$� v + | 2300$� ,----------------. | 2301$� (" starting...tag,") | 2302$� `--------+---+---' | <-- 2303$� | ? |------' 2304$� +---+ 2305 2306$�| | 2307$�|+---------------------------------+| 2308$�||starting with an evil center tag,|| 2309$�|+---------------------------------+| 2310 2311$#<a name="blankLines" id="blankLines"> 2312 2313$=$$<h3>$$$_Blank Lines$_$$</h3>$$ 2314 2315Probably the most tricky part is handling of blank lines. Similar to line 2316breaks, items needing blank lines have them before *and* after them. However, 2317when two such items meet, they have only *one* blank line between them. 2318Furthermore, a blank line is *never* inserted before the first or after the 2319last item inside a box. That's why blank lines cannot be simply inserted when 2320entering an element causing blank lines, or when leaving it. Instead, only 2321"requests" for blank lines are stored in "para_blank", and if certain 2322conditions are met, a blank line is inserted before the next item (inside 2323add_item()). 2324 2325Actually, the blank line is inserted not in front of the new item, but after 2326the last item, at the same tree depth as that one. This is important, to ensure 2327that a blank line generated in front of a box is actually inserted *outside* of 2328the box, not inside it, as would be the case if it was inserted at the current 2329depth. The global "blank_depth" variable is responsible for this, and is set to 2330the current depth every time a blank line request is generated. The blank item 2331is then inserted directly to the "state" structure at "blank_depth" in 2332add_item(). This is surely very bad programming style ;-) 2333 2334The blank line requests are managed by the global "para_blank" variable. A 2335value of 1 indicates that a blank line is needed, and will be generated by the 2336next add_item(). A value of 0 indicates that no blank line is needed in any 2337case; this situation can only occur when there are no items inside the current 2338box yet. A value of -1 indicates that a blank line *may* be necessary. That is 2339the case when we already have some items inside the current box, but the last 2340item does not need a blank line; a blank line needs to be inserted only when 2341the following item wants one. 2342 2343When entering an element needing a blank line, a request is generated 2344("para_blank" set to 1 and "blank_depth" is stored) if "para_blank" was -1, 2345indicating that there are already items in the current box. 2346 2347$� + 2348$� +------+ 2349$�->| body |-. <++++++++++++++ 2350$� +------+ | <+ + 2351$� v + + 2352$� +----++ +---+ 2353$� --> | h1 |-. <+ ,->| p |-> 2354$� cur_ +----+ | + | +---+ 2355$� tag v + | 2356$� ,--------. | 2357$� (" heading") | 2358$� `--+---+-' | 2359$� | ? |----' 2360$� +---+ 2361 2362$�+------------------+ 2363$�|+-----------+ | 2364$�||header text| | 2365$�|+-----------+ | 2366$�??? (para_blank=-1) 2367 2368$� + 2369$� +------+ 2370$�->| body |-. <++++++++++++++ 2371$� +------+ | <+ + 2372$� v + + 2373$� +----++ +---+ 2374$� --> | h1 |-. <+ ,->| p |-> 2375$� ==> +----+ | + | +---+ 2376$� blank_ v + | 2377$� depth ,--------. | 2378$� (" heading") | 2379$� `--+---+-' | 2380$� **> | ? |----' 2381$� list_next +---+ 2382$�(new "cur_el") 2383 2384$�+------------------+ 2385$�|+-----------+ | 2386$�||header text| | 2387$�|+-----------+ | 2388$�| (para_blank=1) | 2389$� ... 2390$�|+-------+ | (will be created later) 2391$�||heading| | 2392$�|+-------+ | 2393 2394If "para_blank" is already 1, the request is left unchanged; a blank line will 2395be inserted already. 2396 2397$�| | 2398$�|+-------+ | 2399$�||heading| | 2400$�|+-------+ | 2401$�| (para_blank=1) | 2402 2403$� | 2404$� v 2405$�+----+ <== +---+ <-- 2406$�| h1 |-. <+ ,->| p |---. <++ 2407$�+----+ | + | +---+ | + 2408$� v + | v + 2409$� ,--------. | ,------------------. 2410$� (" heading") | (" first...newlines,") 2411$� `--+---+-' | `-----+----+-------' 2412$� | ? |----' **> | em |-. 2413$� +---+ +----+ | 2414 2415$�| | 2416$�|+-------+ | 2417$�||heading| | 2418$�|+-------+ | 2419$�| (para_blank=1) | 2420$� ... 2421$�|+------------+ | 2422$�||first...text| | 2423$�|+------------+ | 2424 2425If "para_blank" is 0 nothing is done, even if the element would normally need a 2426blank line: No blank line is ever inserted in front of the first item inside a 2427box. 2428 2429$�+-----------+ 2430$�(para_blank=0) 2431 2432$�+---+ 2433$�| ! |-. <+ 2434$�+---+ | + 2435$� v + 2436$� +---++ 2437$�--> | p |-. <+ 2438$� +---+ | + 2439$� v + 2440$� ,---------. 2441$� ("some text") 2442$� `--+---+--' 2443$� **> | ? |-> 2444$� +---+ 2445 2446$�+-----------+ 2447$� ... 2448$�|+---------+| 2449$�||some text|| 2450$�|+---------+| 2451 2452When some item (text or box) is inserted by add_item(), "para_blank" is always 2453set to -1, as now there is at least one item in the current box. 2454 2455$�+------------------+ 2456$�|+-----------+ | 2457$�||header text| | 2458$�|+-----------+ | 2459$�| (para_blank=1) | 2460 2461$� + 2462$� +------+ 2463$�->| body |-. <++++++++++++++ 2464$� +------+ | <+ + 2465$� v + + 2466$� +----++ +---+ 2467$� | h1 |-. <+ ,->| p |-> 2468$� +----+ | + | +---+ 2469$� v + | 2470$� ,--------. | 2471$� (" heading") | 2472$� `--+---+-' | 2473$� --> | ? |----' 2474$� +---+ 2475 2476$�+------------------+ 2477$�|+-----------+ | 2478$�||header text| | 2479$�|+-----------+ | 2480$�| | 2481$�|+-------+ | 2482$�||heading| | 2483$�|+-------+ | 2484$�??? (para_blank=-1) 2485 2486When leaving an element needing a blank line, a request is generated also, and 2487will be handled in the next add_item(). Otherwise, the current state is kept. 2488 2489$� + 2490$� +------+ 2491$�->| body |-. <++++++++++++++ 2492$� +------+ | <+ + 2493$� v + + 2494$� +----++ +---+ 2495$� xx> | h1 |-. <+ ,->| p |-> <** 2496$�new_el +----+ | + | +---+ 2497$� v + | 2498$� ,--------. | 2499$� (" heading") | 2500$� `--+---+-' | 2501$� --> | ? |----' 2502$� +---+ 2503 2504$�+------------------+ 2505$�|+-----------+ | 2506$�||header text| | 2507$�|+-----------+ | 2508$�| | 2509$�|+-------+ | 2510$�||heading| | 2511$�|+-------+ | 2512$�??? (para_blank=-1) 2513 2514$� + 2515$� +------+ <xx 2516$�->| body |-. <++++++++++++++ 2517$� +------+ | <+ + 2518$� v + + 2519$� +----++ +---+ 2520$� ==> | h1 |-. <+ ,->| p |-> <** 2521$� +----+ | + | +---+ 2522$� v + | 2523$� ,--------. | 2524$� (" heading") | 2525$� `--+---+-' | 2526$� --> | ? |----' 2527$� +---+ 2528 2529$�+------------------+ 2530$�|+-----------+ | 2531$�||header text| | 2532$�|+-----------+ | 2533$�| | 2534$�|+-------+ | 2535$�||heading| | 2536$�|+-------+ | 2537$�| (para_blank=1) | 2538 2539When entering an element generating a box, "para_blank" is reset to 0. 2540 2541$�| | 2542$�|+---------+ | 2543$�||some text| | 2544$�|+---------+ | 2545$�??? (para_blank=-1) 2546 2547$� +--------+ +------+ 2548$�->| center |-. ,->| form |-. <-- 2549$� +--------+ | | +------+ | 2550$� v | | 2551$� ,---------. | | 2552$� ("some text") | v 2553$� `--+---+--' | +---+ 2554$� | ? |-----' **> | p |-> 2555$� +---+ +---+ 2556 2557$�| | 2558$�|+---------+ | 2559$�||some text| | 2560$�|+---------+ | 2561$�|+--------------+| 2562$�(para_blank=0) 2563 2564However, it's left unchanged if there is already a request, indicating that a 2565blank line will be inserted *in front* of the box by the next add_item(). 2566 2567$�| | 2568$�|+---------+ | 2569$�||some text| | 2570$�|+---------+ | 2571$�| (para_blank=1) | 2572 2573$� +---+ +------+ 2574$�->| p |-. <== ,->| form |-. <-- 2575$� +---+ | | +------+ | 2576$� v | | 2577$� ,---------. | | 2578$� ("some text") | v 2579$� `--+---+--' | +---+ 2580$� | ? |-----' **> | p |-> 2581$� +---+ +---+ 2582 2583$�| | 2584$�|+---------+ | 2585$�||some text| | 2586$�|+---------+ | 2587$�| (para_blank=1) | 2588$�|+--------------+| 2589 2590When leaving a box creating element, any request generated inside the box is 2591discarded. We never insert a blank line after the last item of a box. 2592 2593$� ++>NULL 2594$� + .--------------------------------. 2595$� + v | 2596$�+---+ <xx | 2597$�| ! |-. <++ <** | 2598$�+---+ | + | 2599$� v + | 2600$� ,-----------. | 2601$�("header text") | 2602$� `-+------+--' | 2603$� | html |-. <+++++++++++ | 2604$� +------+ | + | 2605$� | [...] [...] 2606$� [...] + | 2607$� | ,---------------. | 2608$� | (" (a single tag)") | 2609$� | `-----+---+-----' | <-- 2610$� `--------->| ? |--------' 2611$� +---+ 2612 2613$�|+--------------+ | 2614$�||(a single tag)| | 2615$�|+--------------+ | 2616$�| (para_blank=1) | 2617 2618$� ++>NULL <xx 2619$� + .--------------------------------. 2620$� + v | 2621$�+---+ | 2622$�| ! |-. <++ <** | 2623$�+---+ | + | 2624$� v + | 2625$� ,-----------. | 2626$�("header text") | 2627$� `-+------+--' | 2628$� | html |-. <+++++++++++ | 2629$� +------+ | + | 2630$� | [...] [...] 2631$� [...] + | 2632$� | ,---------------. | 2633$� | (" (a single tag)") | 2634$� | `-----+---+-----' | <-- 2635$� `--------->| ? |--------' 2636$� +---+ 2637 2638$�|+--------------+ | 2639$�||(a single tag)| | 2640$�|+--------------+ | 2641$�+------------------+ 2642$�??? (para_blank=-1) 2643 2644However, if the last request was generated *before* descending into the element 2645creating the box (only possible if the box is empty), it is kept, and the blank 2646is inserted when the box is added. 2647 2648$�| | 2649$�|+---------+ | 2650$�||some text| | 2651$�|+---------+ | 2652$�| (para_blank=1) | 2653$�|+--------------+| 2654 2655$� +---+ +------+ <xx +---+ 2656$�->| p |-. <== ,->| form |-. ,->| p |-> <** 2657$� +---+ | | +------+ | | +---+ 2658$� v | | | 2659$� ,---------. | | | 2660$� ("some text") | v | 2661$� `--+---+--' | +------+ | 2662$� | ? |-----' --> | span |-' 2663$� +---+ +------+ 2664 2665$�| | 2666$�|+---------+ | 2667$�||some text| | 2668$�|+---------+ | 2669$�| | 2670$�|+--------------+| 2671$�|+--------------+| 2672 2673$=$$<h4>$$$_<br> elements$_$$</h4>$$ 2674 2675<br> elements used to break text blocks as described above. This however turned 2676out to be a bug: some elements (e.g. <a>) can span over a <br>, and these were 2677handled incorrectly with that approach. 2678 2679Now simply a '\n' is added to the text block when <br> is encountered, and then 2680correctly handled while $$<a$+href="#lineBreaking">$$$_Breaking String into 2681Lines$_$$</a>$$$ . 2682 2683$#</a> <!-- blankLines --> 2684 2685$#<a name="links" id="links"> 2686 2687$=$$<h3>$$$_Links$_$$</h3>$$ 2688 2689When a link element is encountered (any <a> element having a "href"), 2690"cur_state->link_type" is set (to FORM_NO), indicatating that the element 2691creates a link. The URL (from the "href" attribute) is saved in 2692"cur_state->link_value". 2693 2694The link is then stored when leaving the link element. (After processing all 2695sub-elements.) The link start is set to "cur_state->link_start", which was the 2696string end position while entering the link element; the link end is set to the 2697current string end position. This way the link spans all text generated inside 2698the element. 2699 2700There is some additional handlig necessary to workaround broken links, however. 2701(This used to be important to make forms work with SGML at all; now that full 2702SGML support is implemented, it probably only helps a few broken pages...) 2703 2704When a link doesn't end in the same string as it started (checked by 2705"cur_state->link_item", which was the current string item when entering the 2706link element), it is stored in the starting string, not the current one. 2707However, if there was no active text item when the link started (indicating the 2708link starts at the beginning of a text block), we can't determine the starting 2709string. In this case, we store it in the current string -- normally, this is 2710the right thing to do; if the link spans multiple strings, on the other hand, 2711at least the last part ist stored this way. 2712 2713Some magic is necessary to handle nested links: The link which is stored later 2714is the outer one, i.e. it starts *before* the previously stored inner link... To 2715handle this in a useful fashion, we also have to *store* it before the inner 2716link. Thus, instead of simply appending it at the end of the string's link list, 2717we have to shift all inner links one position behind, and put the new (outer) 2718link at the free position created in front of them. 2719 2720Additionally, two hashes of the link URL and the text inside the link are 2721stored for each link; this is necessary to recognize the right link to 2722reactivate when a page is revisited but its content changed. See 2723$$<a$+href="hacking-page.html#reactivating">$$_Reactivating Link$_$$</a>$$ in 2724hacking-page.* for details. 2725 2726$#</a> <!-- links --> 2727 2728$=$$<h3>$$$_Forms$_$$</h3>$$ 2729 2730Form elements are handled very similar, and by the same code. The only 2731difference is that "link_type" is set to some form type instead of FORM_NO, and 2732the initial "value" of the element is stored in the structure tree so it will 2733be submitted to the server, and (for some form control types) displayed on the 2734page via $$<a$+href="hacking-links.html#setForm">$$$_set_form()$_$$</a>$$ (see 2735hacking-links.*). 2736 2737Also, "form_enabled" is set for all elements that will be submitted to the 2738server. It is always set for elements that are submitted unconditionally 2739(text/password/hidden input fields), and set to the initial state for elements 2740that are submitted depending on their state (radio buttons, checkboxes, 2741<select> options). Submit buttons initially aren't enabled. 2742 2743Some special handling is necessary for <select> options: They do not have an 2744own "name" attribute to store in "Link->name" like other form elements; the 2745"name" for all options is given in the <select> element instead. Thus it needs 2746to be passed to all the options by the state stack, in "select_name". The same 2747for the link type (FORM_OPTION or FORM_MULTIOPTION) of the option links: It 2748depends on the presence of the "multiple" attribute in the <select> element, 2749and is passed in "select_type" for that reason. 2750 2751$#<a name="virtual" id="virtual"> 2752 2753$=$$<h3>$$$_Virtual Items$_$$</h3>$$ 2754 2755There are two kinds of "virtual" Items: ITEM_BLOCK_ANCHOR and 2756ITEM_INLINE_ANCHOR. As the names suggest, both types are presently used for 2757anchors. The difference is that block anchors are created by block elements 2758with an "id" attribute, and span one or more block elements, while inline 2759anchors are created inside text blocks by the classical "a" element or by any 2760inline element with an "id", and span only a text part. 2761 2762"virtual" means that these items do not affect layouting like other items; they 2763only create some additional structure. 2764 2765$=$$<h4>$$$_Block Anchors$_$$</h4>$$ 2766 2767As block anchors span multiple other elements, they have to act as boxes, 2768containing a series of virtual children. In contrast to real box items, they do 2769not get their size assigned by the parent, and do not assign size to their 2770virtual children. The children get their size assigned by the real parent 2771directly, and the size of the virtual box is determined afterwards, from the 2772outer bounds of all virtual children. 2773 2774In the present implementation, virtual box items aren't inserted normally as 2775parents of their virtual children into the item tree. They are simply inserted 2776as normal items after their virtual children, at the same tree depth. The 2777virtual box is created only by special pointers, handled only in the necessary 2778places; the normal tree traversal functions do not know about them. This is a 2779hack; the idea was to integrate them into the existing system changing as 2780little as possible. Most probably this will be replaced by a clean 2781implementation using real parent/children relations in the future. 2782 2783$�virtual tree: 2784$� +-----+ 2785$� ,->| box | 2786$� | +-----+ 2787$� | x ^ 2788$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 2789$� x +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2790$� v + + + | 2791$�+------+ +--------------+ +------+ | 2792$�| text |-. ,->| block anchor |-->| text |-' 2793$�+------+=|======================|=>+--------------+==>+------+==>NULL 2794$� | | x ^ 2795$� | xxxxxxxxxxxxxxxxxxxxxxxx + 2796$� | x +++++++++++++++++++++++++++ 2797$� | v + + | 2798$� | +------+ +------+ | 2799$� `->| text |-->| text |-' 2800$� +------+ +------+==>NULL 2801 2802$�real tree: 2803$� 2804$����> virtual child +-----+ 2805$� ,->| box | 2806$� virtual box | +-----+ 2807$� ,----------^----------. | x ^ 2808$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 2809$� x +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2810$� v + + + + + | 2811$�+------+ +------+ +------+ +--------------+ +------+ | 2812$�| text |-->| text |-->| text |-->| block anchor |-->| text |-' 2813$�+------+==>+------+==>+------+==>+--------------+==>+------+==>NULL 2814$� ^ � 2815$� � � 2816$� ���������������������������� 2817 2818$=$$<h4>$$$_Inline Anchors$_$$</h4>$$ 2819 2820As inline anchors are contained within text elements, they have to act as 2821children of text items. Text item however can't have children in the present 2822implementation; thus, inline anchors presently are also only virtual children. 2823 2824Similar to the virtual boxes created by block anchors, these virtual children 2825are stored after their virtual parent, at the same tree depth. This is a hack 2826just as the virtual boxes, and will be replaced also. 2827 2828Additionally to the virtual parent (and the name of the anchor), inline anchors 2829need to store the position of their start and end inside the parent string. 2830 2831$�virtual tree: 2832$� 2833$�%%%> anchor_start,anchor_end +-----+ 2834$� .->| box | 2835$� | +-----+ 2836$� | x ^ 2837$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 2838$� x +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2839$� v + + + | 2840$�+------+ +------+ +------+ | 2841$�| text |-. .->| text |-->| text |-' 2842$�+------+=|========================================|=>+------+==>+------+==>NULL 2843$� | | x ^ * 2844$� | | x + * ,--------------------. 2845$� | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + **>("anchor1 and anchor2.") 2846$� | x ++++++++++++++++++++++++++++++++++++++ `--------------------' 2847$� | x + + | ^ ^ ^ ^ 2848$� | x + + | % % % % 2849$� | x + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % % 2850$� | x + % + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2851$� | v + % + % | 2852$� | +---------------+ +---------------+ | 2853$� `->| inline anchor |-->| inline anchor |-' 2854$� +---------------+==>+---------------+==>NULL 2855 2856$�real tree: 2857$� 2858$����> virtual parent +-----+ 2859$� .->| box | 2860$� | +-----+ 2861$� | x ^ 2862$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 2863$� x +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2864$� v + + + + + | 2865$�+------+ +------+ +---------------+ +---------------+ +------+ | 2866$�| text |-->| text |-->| inline anchor |-->| inline anchor |-->| text |-' 2867$�+------+==>+------+==>+---------------+==>+---------------+==>+------+==>NULL 2868$� * ^ �% �% 2869$�%%%%%%%%%%%%%*%%�%%%%%%%%%%%%%�% �% 2870$�% * � � �% 2871$�% * �����������������������������������% 2872$�% v % 2873$�% ,--------------------. % 2874$�% ("anchor1 and anchor2.") % 2875$�% `--------------------' % 2876$�% ^ ^ ^ ^ % 2877$�% % % % % % 2878$�%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2879 2880Links are very similar to inline anchors, and probably will be stored the same 2881way as anchors in the future, once ITEM_INLINE is cleanly implemented. The 2882present different handling of links and anchors is by purpose -- to see which 2883method turns out to be better. (Presently, it looks much like the anchor-method 2884is better.) 2885 2886$#</a> <!-- virutal --> 2887 2888$=$$<h3>$$$_Finishing$_$$</h3>$$ 2889 2890After all elements of the syntax tree are traversed, the structure tree is 2891finalized by setting "next" of the tree top (the global item) to point back to 2892itself, and "parent" point to the first item. 2893 2894parse_struct() returns a "struct Item" pointer to the tree top. This is passed 2895as argument to the following passes by main(). 2896 2897Is is a bit strange that "parent" of the tree top points to the first item. 2898However, it is important to have some simple way of accessing the first item, 2899as some of the following passes start with the fist item; but only the pointer 2900to the tree top is passed on the following passes. Maybe we should use some 2901function that finds the first item by following "first_child" instead -- as 2902this is done only once in every processing pass, it wouldn't be too big an 2903unefficiency. 2904 2905$#</a> <!--parseStruct--> 2906 2907$#<a name="freeItems" id="freeItems"> 2908 2909$=$$<h3>$$$_free_items()$_$$</h3>$$ 2910 2911This function is responsible for freeing the memory used by the item tree. 2912 2913Just as $$<a$+href="#freeSyntax">$$$_free_syntax()$_$$</a>$$$ , the tree is 2914traversed by "list_next", and every node is freed, including all data belonging 2915to it. 2916 2917For text items, the associated strings are freed also; for anchor items, the 2918anchor data is freed. 2919 2920$#</a> <!--freeItems--> 2921 2922$#<a name="preRender" id="preRender"> 2923 2924$=$$<h2>$$$_7. pre-render.c$_$$</h2>$$ 2925 2926So far, the item tree only represents the structure of the page, i.e. the 2927dependencies of the items. The Items have no actual sizes nor positions yet. 2928 2929$�# ################################################################################################ 2930$� 2931$�# +----------- 2932$�# |header text 2933$� 2934$�# *** 2935$�# * 2936$� 2937$�# +------- 2938$�# |heading 2939$� 2940$�# *** 2941$�# * 2942$� 2943$�# +----------------------------------------------------------------------------------------------- 2944$�# |first paragraph of text; includes multiple spaces and newlines, emphasized text and strong text 2945$� 2946$�# *** 2947$�# * 2948$� 2949$�# +--------------------------------- 2950$�# |starting with an evil center tag, 2951$� 2952$�# +----------------------------------------------------------------------------------------------- 2953$�# |this very long second paragraph contains some special characters (including a simple space...): 2954$�# | &; <>"=/ plus a big gap###and two unicode escapes (decimal: � and hexal: �) but also an anchor 2955$�# | embedded inside a word (this anchor also is the only tag with parameters); and finally a blank 2956$�# | row 2957$� 2958$�# +-------------- 2959$�# |(a single tag) 2960 2961pre_render() assigns coordinates inside the layouted page to all items. It also 2962breaks the text blocks into lines. 2963 2964This process is reversible (it doesn't alter any data stored before; it only 2965creates additional data), and thus resizing the output area of the viewer will 2966be possible without regenerating the item tree. (And without reloading the 2967file.) 2968 2969The coordinates are stored to "x_start", "x_end", "y_start" and "y_end" of the 2970item structures. For every text item, the positions of all line breaks are 2971stored in "line_table[]". This is the only way line breaks are indicated; the 2972text string itself isn't altered. Thus the rendering function has to look at 2973this table while generating output. ($ $$<a$+href="#renderC">$$$_8. render.c$_$$</a>$$$ ) 2974 2975Besides of assigning the coordinates, pre_render() also generates the 2976"page_map[]". This is a structure telling which items occupy every line of the 2977output page. For every line, a table is generated, containing references to all 2978elements that show up in this line. It's described more thoroughly in 2979$$<a$+href="#createMap">$$$_create_map()$_$$</a>$$$ . 2980 2981The processing is split into five smaller passes, executed successively by 2982pre_render(). It also could be done in one single, pseudo-recursive pass. Most 2983probably this would be more efficient; it would be harder to understand, too. 2984Maybe we will change this at some point... 2985 2986$#<a name="calcWidth" id="calcWidth"> 2987 2988$=$$<h3>$$$_calc_width()$_$$</h3>$$ 2989 2990In the first pass, the minimal x-width of all items is calculated. This is done 2991by traversing the item tree by "list_next" (bottom to top). This ensures that 2992the sizes of all children are calculated before the parent. In every iteration 2993one item is processed. The actions taken depends on the type of the item. 2994 2995For text items, the minimal width (stored in "x_end") is set to the width of 2996the longest word in the text block, as in other browsers. Note that we *can* 2997generate narrower text blocks; netrik can break words that do not fit on a 2998line, and probably also scrolling of items will be possible in the future. 2999(AFAIK no other browser does any of that, although it was recommended by the 3000W3C for years...) 3001 3002$�# ###############[...] 3003$� 3004$�# +------+ 3005$�# |header| 3006$� ^ ^x_end=6 3007$� 0 3008$� 3009$�# *** 3010$�# * 3011$� 3012$�# +------- 3013$�# |heading 3014$�[...] 3015 3016For blank lines, the minimal width is set to 0 -- blank lines do not need any 3017width; anchor items as well. 3018 3019$�# ###############[...] 3020$� 3021$�# +------+ 3022$�# |header| 3023$� 3024$�# ** 3025$�# ** 3026$� ^x_end=0 3027$� 3028$�# +------- 3029$�# |heading 3030$�[...] 3031 3032Finding the longest word is quite easy. The whole text block is processed char 3033by char in a loop. For each char, the current word length "len" is incremented. 3034When the word ends (space or string end encountered), "len" is reset to 0. 3035"longest" keeps the current maximum, and is stored to "x_end" after the whole 3036text block was processed. 3037 3038For box items, the minimal width is the one of the widest sub-item. 3039 3040$�# ############## 3041$� 3042$�# +------+ # 3043$�# |header| # 3044$� 3045$�# ** # 3046$�# ** # 3047$� 3048$�# +-------+ # 3049$�# |heading| # 3050$� 3051$�# ** # 3052$�# ** # 3053$� 3054$�# +----------+ # 3055$�# |emphasized| # 3056$� 3057$�# ** # 3058$�# ** # 3059$� 3060$�# +--------+ # 3061$�# |starting| # 3062$� 3063$�# +------------# 3064$�# |parameters);# 3065$� 3066$�# +------+ # 3067$�# |single| # 3068$� ^ ^x_end 3069$� 0 3070 3071As all sub-items are processed before the parent, we already know the widths of 3072the sub-items when processing a box item. We simply go trough all immediate 3073children (start with "first_child" and go on by "next") and look for the 3074maximum. 3075 3076$#</a> <!--calcWidth--> 3077 3078$#<a name="assignWidth" id="assignWidth"> 3079 3080$=$$<h3>$$$_assign_width()$_$$</h3>$$ 3081 3082The second pass assigns the x-coordinates (x_start and x_end) to all items. 3083(Presently, this is trivial: all items have the same coordinates as their 3084parent, which is the global item...) For text items, the positions of line 3085breaks are also calculated. 3086 3087$=$$<h4>$$$_Traversing Item Tree Top to Bottom$_$$</h4>$$ 3088 3089In this pass, the tree is traversed top to bottom, as the coordinates of the 3090sub-items depend on the coordinates of the parent. This is a bit more 3091complicated than traversing bottom to top, as there is no equivalent of the 3092"list_next" pointer for this. 3093 3094If the current item has children, we proceed with the first child. (Descend.) 3095 3096$� <+++ 3097$� + 3098$� +-----+ 3099$� ,->| box |-->NULL 3100$� | +-----+<==# 3101$� | x ^ #===# 3102$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 3103$� x +++++++++++++++++++++++++++++++++++ 3104$� v + + | 3105$� cur_item xx> +-----+ +-----+ | 3106$� ,->| box |-. ,->| box |-' 3107$� new cur_item | +-----+=|===========|=>+-----+==>NULL 3108$� | | x ^ | | x ^ 3109$� xxxxxxxx|xxxxxxxxxxxxxxx + | xxxxxxxxxxxxx + 3110$� x +++++|+++++++++++++++++ | x ++++++++++++ 3111$� v + | + | | v + | 3112$�+------+<-'+------+ | | +------+ | 3113$�| text |-->| text |-' `->| text |-' 3114$�+------+==>+------+==>NULL +------+==>NULL 3115$� x x x 3116$� v v v 3117$� NULL NULL NULL 3118 3119If the current item has no children, we go to the "next" item. 3120 3121$� <+++ 3122$� + 3123$� +-----+ 3124$� ,->| box |-->NULL 3125$� | +-----+<==# 3126$� | x ^ #===# 3127$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 3128$� x +++++++++++++++++++++++++++++++++++ 3129$� v + + | 3130$� +-----+ +-----+ | 3131$� ,->| box |-. ,->| box |-' 3132$� | +-----+=|===========|=>+-----+==>NULL 3133$� | x ^ | | x ^ 3134$� xxxxxxxxxxxxxxxxxxxxxxxx + | xxxxxxxxxxxxx + 3135$� x +++++++++++++++++++++++ | x ++++++++++++ 3136$� v + + | | v + | 3137$�+------+ +------+ | <-- | +------+ | 3138$�| text |-->| text |-' `->| text |-' 3139$�+------+==>+------+==>NULL +------+==>NULL 3140$� x ^xx x x 3141$� v v v 3142$� NULL NULL NULL 3143 3144If there is no "next" item (we are already at the last item of this depth), we 3145have to ascend before we can go to the next item. 3146 3147$� <+++ 3148$� + 3149$� +-----+ 3150$� ,->| box |-->NULL 3151$� | +-----+<==# 3152$� | x ^ #===# 3153$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 3154$� x +++++++++++++++++++++++++++++++++++ 3155$� v + + | 3156$� +-----+ +-----+ | 3157$� --> ,->| box |-. ,->| box |-' 3158$� | +-----+=|===========|=>+-----+==>NULL 3159$� | x ^ | | x ^ 3160$� xxxxxxxxxxxxxxxxxxxxxxxx + | xxxxxxxxxxxxx + 3161$� x +++++++++++++++++++++++ | x ++++++++++++ 3162$� v + + | | v + | 3163$�+------+ +------+ | <xx | +------+ | 3164$�| text |-->| text |-' `->| text |-' 3165$�+------+==>+------+==>NULL +------+==>NULL 3166$� x x x 3167$� v v v 3168$� NULL NULL NULL 3169 3170$� <+++ 3171$� + 3172$� +-----+ 3173$� ,->| box |-->NULL 3174$� | +-----+<==# 3175$� | x ^ #===# 3176$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 3177$� x +++++++++++++++++++++++++++++++++++ 3178$� v + + | 3179$� +-----+ --> +-----+ | 3180$� ,->| box |-. ,->| box |-' 3181$� | +-----+=|===========|=>+-----+==>NULL 3182$� | x ^ | | x ^ 3183$� xxxxxxxxxxxxxxxxxxxxxxxx + | xxxxxxxxxxxxx + 3184$� x +++++++++++++++++++++++ | x ++++++++++++ 3185$� v + + | | v + | 3186$�+------+ +------+ | <xx | +------+ | 3187$�| text |-->| text |-' `->| text |-' 3188$�+------+==>+------+==>NULL +------+==>NULL 3189$� x x x 3190$� v v v 3191$� NULL NULL NULL 3192 3193If we still do not have a "next" item after ascending, we ascend again. 3194 3195$� <+++ 3196$� + 3197$� +-----+ 3198$� ,->| box |-->NULL 3199$� | +-----+<==# 3200$� | x ^ #===# 3201$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 3202$� x +++++++++++++++++++++++++++++++++++ 3203$� v + + | 3204$� +-----+ --> +-----+ | 3205$� ,->| box |-. ,->| box |-' 3206$� | +-----+=|===========|=>+-----+==>NULL 3207$� | x ^ | | x ^ 3208$� xxxxxxxxxxxxxxxxxxxxxxxx + | xxxxxxxxxxxxx + 3209$� x +++++++++++++++++++++++ | x ++++++++++++ 3210$� v + + | | v + | 3211$�+------+ +------+ | | +------+ | <xx 3212$�| text |-->| text |-' `->| text |-' 3213$�+------+==>+------+==>NULL +------+==>NULL 3214$� x x x 3215$� v v v 3216$� NULL NULL NULL 3217 3218$� <+++ 3219$� + 3220$� +-----+ 3221$� --> ,->| box |-->NULL 3222$� | +-----+<==# 3223$� | x ^ #===# 3224$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 3225$� x +++++++++++++++++++++++++++++++++++ 3226$� v + + | 3227$� +-----+ +-----+ | 3228$� ,->| box |-. ,->| box |-' 3229$� | +-----+=|===========|=>+-----+==>NULL 3230$� | x ^ | | x ^ 3231$� xxxxxxxxxxxxxxxxxxxxxxxx + | xxxxxxxxxxxxx + 3232$� x +++++++++++++++++++++++ | x ++++++++++++ 3233$� v + + | | v + | 3234$�+------+ +------+ | | +------+ | <xx 3235$�| text |-->| text |-' `->| text |-' 3236$�+------+==>+------+==>NULL +------+==>NULL 3237$� x x x 3238$� v v v 3239$� NULL NULL NULL 3240 3241After all items have been processed, we'll ascend until we get back to the top 3242item, which's "next" pointer points back to itself; Thus we stay at the tree 3243top after following "next", and the main loop is terminated. 3244 3245$� <+++ 3246$� + 3247$� +-----+ 3248$� --> ,->| box |-->NULL 3249$� | +-----+<==# 3250$� | x ^ #===# 3251$� xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx + 3252$� x +++++++++++++++++++++++++++++++++++ 3253$� v + + | 3254$� +-----+ +-----+ | 3255$� ,->| box |-. ,->| box |-' 3256$� | +-----+=|===========|=>+-----+==>NULL 3257$� | x ^ | | x ^ 3258$� xxxxxxxxxxxxxxxxxxxxxxxx + | xxxxxxxxxxxxx + 3259$� x +++++++++++++++++++++++ | x ++++++++++++ 3260$� v + + | | v + | 3261$�+------+ +------+ | | +------+ | <xx 3262$�| text |-->| text |-' `->| text |-' 3263$�+------+==>+------+==>NULL +------+==>NULL 3264$� x x x 3265$� v v v 3266$� NULL NULL NULL 3267 3268$=$$<h4>$$$_Assigning x-Coordinates$_$$</h4>$$ 3269 3270When processing an item, not the coordinates of "cur_item" are assigned, but 3271the coordinates of its *sub-items*. "cur_item" has had its coodinates assigned 3272while its parent was processed. This is an exact reversal of $$<a 3273href="#calcWidth">$$$_calc_width()$_$$</a>$$$ : Instead of "cur_item" getting 3274its size from its sub-items, the sub-items get their size from "cur_item". 3275 3276The coordinates for the global item are assigned before calling assign_width(). 3277x-start is set to 0, x-end is set to the output width passed from main(). (The 3278screen width when cfg.term_width is set; otherwise, either a constant value (in 3279--dump mode) or the maximum of screen width and page width.) Play around with 3280them if you like. (If set to a narrow box, you can see the word breaking work.) 3281 3282$�################################################################################## 3283$� 3284$�#------+ # 3285$�#header| # 3286$� 3287$�#* # 3288$�#* # 3289$�[...] 3290$�#------------+ # 3291$�#parameters);| # 3292$� 3293$�#------+ # 3294$�#single| # 3295$� ^x_start=0 ^x_end=80 3296 3297For box (and form) items, the coordinates of all immediate children are simply 3298set to the ones of the box. (This will have to change at some point, but it's 3299ok for now...) It's done in a simple loop processing the children by 3300"first_child" and "next", just as in 3301$$<a$+href="#calcWidth">$$$_calc_width()$_$$</a>$$$ . 3302 3303$�################################################################################## 3304$� 3305$�+--------------------------------------------------------------------------------+ 3306$�|header text | 3307$� 3308$�********************************************************************************** 3309$�* * 3310$�[...] 3311$�+--------------------------------------------------------------------------------+ 3312$�|this very long second paragraph contains some special characters (including a | 3313$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | 3314$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 3315$�|only tag with parameters); and finally a blank row | 3316$� 3317$�+--------------------------------------------------------------------------------+ 3318$�|(a single tag) | 3319 3320Blank items have no children. Nothing has to be done. The same for anchor items. 3321 3322Text items have no children, too; but they need to be broken into lines. (This 3323could be done in a seperate pass...) 3324 3325$#<a name="lineBreaking" id="lineBreaking"> 3326 3327$=$$<h4>$$$_Breaking String into Lines$_$$</h4>$$ 3328 3329The breaking into lines is a bit complicated, because netrik has the ability to 3330break words that do not fit on a single line. And: It does this smartly, 3331avoiding line breaks whenever possible. 3332 3333The line breaking does not alter the string structure itself. The only output 3334is "line_table[]" (containing the positions of all line breaks inside the 3335string), and its length (The number of line breaks), which is stored as the 3336height of the text item in "y_end". 3337 3338Before starting, some constants are calculated: The width of the text block is 3339the difference of x_end and x_start. The pointer to the start of the text 3340string is taken from the string structure and stored in "string_start". 3341 3342The text block is processed word-wise. In each iteration of the outer loop one 3343word is processed. (There are some exceptions though for extremely long words; 3344this will be explained in a moment.) 3345 3346First, "word_end" is retrieved. It points to the space, newline or string 3347terminator that ends the word to be processed. "next_word_start" is set to 3348"word_end+1", assuming that we will proceed with the next word (which starts 3349after the space); however, this may by altered later. 3350 3351$� line_table[0] 3352$� v 3353$�string: "I am a very simple and stupid example sentence." 3354$� ^ ^ ^ ^^ 3355$� s l w en 3356$�string_start line_start next_word_start 3357$� word_end 3358$� word_start 3359$�output: 3360$� x_start x_end 3361$� |<- width ->| 3362$� v 12 v 3363$�+------------+ 3364$�|I am a very | 3365$�|simple and stupid example sentence. 3366$� ^ ^ ^^ 3367$� l w en 3368 3369Then we test if the whole word fits into the current open line (which's 3370starting position is stored in "line_start"). 3371 3372$� v 3373$�string: "I am a very simple and stupid example sentence." 3374$� ^ ^ ^ ^^^ 3375$� s l w enb 3376$� line_start+width (line end/Break position) 3377$� |<- width ->| 3378$� (12) 3379$�+------------+ 3380$�|I am a very | 3381$�|simple and stupid example sentence. 3382$� ^ ^ ^^^ 3383$� l w enb 3384$� e<b -> no wrap 3385 3386If it does, we do nothing, and simply proceed with the next word. "word_start" 3387is set to "next_word_start" before the next iteration. 3388 3389$� v 3390$�string: "I am a very simple and stupid example sentence." 3391$� ^ ^ ^ 3392$� s l n 3393$� ^ 3394$� w 3395 3396Otherwise, we have to generate a line wrap, and put the word on a new line. 3397 3398$� v 3399$�string: "I am a very simple and stupid example sentence." 3400$� ^ ^ ^^ ^^ 3401$� s l wb en 3402$�+------------+ 3403$�|I am a very | 3404$�|simple and stupid example sentence. 3405$� ^ ^^ ^^ 3406$� l wb en 3407$� e>=b -> wrap 3408 3409Normally, we simply wrap the line at the current word, by setting "line_start" 3410to "word_start", and adding a line break at this position to "line_table[]". 3411 3412$� line_table[1] 3413$� v v 3414$�string: "I am a very simple and stupid example sentence." 3415$� ^ ^ ^^ 3416$� s w en 3417$� ^ 3418$� l 3419$�+------------+ 3420$�|I am a very | 3421$�|simple and | 3422$�|stupid example sentence. 3423$� ^ ^^ 3424$� w en 3425$� ^ 3426$� l 3427 3428However, things are more complicated due to the ability of breaking too long 3429words. We have to decide whether we simply put the word on a new line as 3430described above, or if it's necessary/better to break it at the line end. And 3431this is the tricky part. 3432 3433We can't just break all words, as it would cause many really unnecessary 3434breaks, which is really ugly and hard to read. 3435 3436$�+------------+ 3437$�|Having aRea\| 3438$�|llyLongWord | 3439$�|followed by | 3440$�|anotherLong\| 3441$�|Word, I am | 3442$�|quite inter\| <-- 3443$�|esting. | 3444$�+------------+ 3445 3446Wrapping before every word that doesn't fit on the line end is also not 3447optimal: 3448 3449$�+------------+ 3450$�|Having | <-- 3451$�|aReallyLong\| 3452$�|Word | <-- 3453$�|followed by | 3454$�|anotherLong\| 3455$�|Word, I am | 3456$�|quite | 3457$�|interesting.| 3458$�+------------+ 3459 3460There is lots of space wasted, and it looks quite ugly. 3461 3462A better solution is putting the beginning of a word on the current line always 3463as long as this doesn't introduce an additional (unnecessary) word break: 3464 3465$�+------------+ 3466$�|Having aRea\| <-- 3467$�|llyLongWord | 3468$�|followed by | 3469$�|anotherLong\| 3470$�|Word, I am | 3471$�|quite | <-- 3472$�interesting. | 3473$�+------------+ 3474 3475(The first arrow shows a case where the beginnig of the word was put on the 3476line end to better fill the space, the second one a case where this is not done 3477to avoid an unnecessary word break.) 3478 3479The Problem here is to decide whether we can put the beginning of the word on 3480the line end without introducing an additional line break. For this, we 3481truncate the word, chopping off as many whole line widths ("width-1", because 3482of the break chars) from the end of the word as possible, and test if the 3483remaining part fits on the current line. 3484 3485$� v v v v 3486$�string: "I am a second example sentence with aVeryLongWord in me." 3487$� ^ ^ ^ ^ ^^ 3488$� l w t b en 3489$� trunc_word_end=word_end-(width-1) 3490$� |<- ->| 3491$� width-1 (11) 3492$�+------------+ 3493$�|I am a | 3494$�|second | 3495$�|example | 3496$�|sentence | 3497$�|with a\ | 3498$� ^ ^ ^ ^ 3499$� l w t b 3500$� t<=b -> fill line 3501$� 3502$�|VeryLongWord| (truncated part) 3503$� ^ 3504$� e 3505 3506Or: 3507 3508$� v v v v v 3509$�string: "I am a third example sentence, with anEvenABitLongerWord in me." 3510$� ^ ^ ^ ^ ^^ 3511$� l w b t en 3512$�+------------+ 3513$�|I am a third| 3514$�|example | 3515$�|sentence, | 3516$�|with anEvenAB\ 3517$� ^ ^ ^ ^ 3518$� l w b t 3519$� t>b -> put word on new line 3520$� 3521$�|itLongerWord| 3522$� ^ 3523$� e 3524 3525For a longer word, the truncation is done more than once: 3526 3527$� v v v 3528$�string: "I am another example sentence, with aTerriblyLongWordThatDoesn'tSeemToEnd in me." 3529$� ^ ^ ^ ^ ^ ^ ^^ 3530$� l w T b t t en 3531$� (final) 3532$�+------------+ 3533$�|I am another| 3534$�|example | 3535$�|sentence, | 3536$�|with aTe\ | 3537$� ^ ^ ^ ^ 3538$� l w T b t<=b -> fill line 3539$� 3540$�|rriblyLongW\| 3541$� ^ 3542$� t 3543$�|ordThatDoes\| 3544$� ^ 3545$� t 3546$�|n'tSeemToEnd in me. 3547$� ^^ 3548$� en 3549 3550or: 3551 3552$� v 3553$�string: "This example has again aTerriblyLongWordThatDoesn'tSeemToEnd in some other sentence." 3554$� ^ ^ ^ ^ ^ ^ ^^ 3555$� l w b T t t en 3556$�+------------+ 3557$�|This example| 3558$�|has again aTe\ 3559$� ^ ^ ^ ^ 3560$� l w b T 3561$� t>b -> don't fill 3562 3563This is done in a loop. We also could do it by a modulo division, but 3564evaluating a quite complicated expression is probably less efficient than a 3565simple loop, which is entered only in exceptional cases anyways. (Words wider 3566than a line are not very common, after all...) Also, it is easier to 3567understand, isn't it?... 3568 3569Note that this only takes effect for words that actually need to be broken -- 3570words shorter than the line width will never be truncated; the remaining part 3571will still not fit on the line, so the word is just put on the new line as 3572described above. 3573 3574If we decide on the first way, we fill up the current line with the beginning 3575of the word. (The last char is kept empty for the word break indicator 3576displayed in the output.) The "line_start" is set accordingly, so that the rest 3577of the word will be put on the next line. 3578 3579$� v v v v v 3580$�string: "I am a second example sentence with aVeryLongWord in me." 3581$� ^ ^^ ^^ 3582$� w lb en 3583$�+------------+ 3584$�|I am a | 3585$�|second | 3586$�|example | 3587$�|sentence | 3588$�|with aVeryL\| 3589$�|ongWord in me. 3590$� ^ ^^ 3591$� l en 3592 3593If the second way was chosen, things are quite easy: The line wrap is inserted 3594at the beginning of the word. Again, this is the same case as for short words 3595that aren't broken at all -- there is no extra handling for this. 3596 3597$� v v v v v 3598$�string: "I am a third example sentence, with anEvenABitLongerWord in me." 3599$� ^ ^^ 3600$� w en 3601$� ^ 3602$� l 3603$�+------------+ 3604$�|I am a third| 3605$�|example | 3606$�|sentence, | 3607$�|with | 3608$�|anEvenABitLongerWord in me." 3609$� ^ ^^ 3610$� l en 3611 3612Now we need to test if the remainder fits on the new line; if it does not, we 3613have to adjust "next_word_start" to make sure processing in the next loop 3614iteration will not continue with the next word, but with the part of the 3615current word that does not fit on the new line. 3616 3617$� v v v v v 3618$�string: "I am a third example sentence, with anEvenABitLongerWord in me." 3619$� ^ ^ ^^ 3620$� l b en 3621$�+------------+ 3622$�|I am a third| 3623$�|example | 3624$�|sentence, | 3625$�|with | 3626$�|anEvenABitLongerWord in me." 3627$� ^ ^ ^^ 3628$� l b en 3629$� e>=b -> more breaks necessary 3630 3631$� v v v v v 3632$�string: "I am a third example sentence, with anEvenABitLongerWord in me." 3633$� ^ ^^ � 3634$� l nb n 3635$� (old value) 3636$�+------------+ 3637$�|I am a third| 3638$�|example | 3639$�|sentence, | 3640$�|with | 3641$�|anEvenABitLongerWord in me." 3642$� ^ ^^ � 3643$� l nb n 3644 3645This way we will scan for the word end again in the next iteration; but 3646starting with the part of the word that does not fit on the new line. (This is 3647unefficient of course, but this case is quite rare, and it's not worth adding 3648special code for handling this.) 3649 3650$� v v v v v 3651$�string: "I am a third example sentence, with anEvenABitLongerWord in me." 3652$� ^ ^ ^^ 3653$� l w en 3654$�+------------+ 3655$�|I am a third| 3656$�|example | 3657$�|sentence, | 3658$�|with | 3659$�|anEvenABitLongerWord in me." 3660$� ^ ^^ ^^ 3661$� l wb en 3662 3663As the new line is already filled up with the previous word part, a line break 3664will always be inserted just in front of the remainder, and the remainder will 3665be put into another line. 3666 3667$� v v v v v 3668$�string: "I am a third example sentence, with anEvenABitLongerWord in me." 3669$� ^ ^^ ^^ 3670$� l wb en 3671$�+------------+ 3672$�|I am a third| 3673$�|example | 3674$�|sentence, | 3675$�|with | 3676$�|anEvenABitLongerWord in me." 3677$� ^ ^^ ^^ 3678$� l wb en 3679$� e>b -> wrap 3680 3681$� v v v v v 3682$�string: "I am a third example sentence, with anEvenABitLongerWord in me." 3683$� ^� ^^ 3684$� wb en 3685$� ^ 3686$� l 3687$�+------------+ 3688$�|I am a third| 3689$�|example | 3690$�|sentence, | 3691$�|with | 3692$�|anEvenABitL\| 3693$�|ongerWord in me." 3694$� ^ ^^ 3695$� l en 3696 3697Of course it's also possible that the remainder still doesn't fit on a line, 3698and has to be broken again. 3699 3700$� v v v v 3701$�string: "I am another example sentence, with aTerriblyLongWordThatDoesn'tSeemToEnd in me." 3702$� ^ ^^ ^^ 3703$� l wb en 3704$�+------------+ 3705$�|I am another| 3706$�|example | 3707$�|sentence, | 3708$�|with aTerri\| 3709$�|blyLongWordThatDoesn'tSeemToEnd in me. 3710$� ^ ^^ ^^ 3711$� l wb en 3712$� e>b -> wrap 3713 3714$� v v v v 3715$�string: "I am another example sentence, with aTerriblyLongWordThatDoesn'tSeemToEnd in me." 3716$� ^ ^^ ^ ^^ 3717$� l wb t en 3718$�+------------+ 3719$�|I am another| 3720$�|example | 3721$�|sentence, | 3722$�|with aTerri\| 3723$�|blyLongWordThatDoes\| 3724$� ^ ^^ ^ 3725$� l wb t 3726$� t>b -> don't fill 3727 3728$� v v v v v 3729$�string: "I am another example sentence, with aTerriblyLongWordThatDoesn'tSeemToEnd in me." 3730$� ^ ^^ ^ 3731$� w nb e 3732$� ^ 3733$� l 3734$�+------------+ 3735$�|I am another| 3736$�|example | 3737$�|sentence, | 3738$�|with aTerri\| 3739$�|blyLongWord\| 3740$�|ThatDoesn'tSeemToEnd in me. 3741$� ^ ^^ ^ 3742$� l nb e 3743 3744This will be repeated, until the whole word is stored. 3745 3746$� v v v v v 3747$�string: "I am another example sentence, with aTerriblyLongWordThatDoesn'tSeemToEnd in me." 3748$� ^ ^^ ^^ 3749$� l wb en 3750$� e>b -> wrap 3751 3752$� v v v v v v 3753$�string: "I am another example sentence, with aTerriblyLongWordThatDoesn'tSeemToEnd in me." 3754$� ^ ^^ 3755$� w en 3756$� ^ 3757$� l 3758$�+------------+ 3759$�|I am another| 3760$�|example | 3761$�|sentence, | 3762$�|with aTerri\| 3763$�|blyLongWord\| 3764$�|ThatDoesn't\| 3765$�|SeemToEnd in me. 3766$� ^ ^^ 3767$� l en 3768 3769Of course, a line break is always inserted in front of the word no matter if it 3770would fit on the old line, if the word follows a newline character. 3771 3772Implementing this without considerable bloat and slowdown requires a little 3773trick: We test for the newline after normally scanning for the word end. Now if 3774there is a newline, we just set "word_end" to "word_start+width" -- we just 3775pretend that the current word is exactly as long as the line, so it will always 3776be put on a new line. 3777 3778$� v v v v v 3779$�string: "I am a stupid example sentence with a newline/in me." 3780$� ^ ^ ^ ^^^ 3781$� s l w enb 3782$�+------------+ 3783$�|I am a | 3784$�[...] 3785$�|newline/in me. 3786$� ^ ^ ^^^ 3787$� l w enb 3788$� ^newline in front of word 3789 3790$� v v v v v 3791$�string: "I am a stupid example sentence with a newline/in me." 3792$� ^ ^ ^ ^^ ^ 3793$� s l w nb e 3794$�+------------+ 3795$�|I am a | 3796$�[...] 3797$�|newline/in me. 3798$� ^ ^ ^^ ^ 3799$� l w nb e 3800$� e>b -> wrap 3801 3802$�+------------+ 3803$�|I am a | 3804$�[...] 3805$�|newline/ | 3806$�|in me. 3807$� ^ ^ ^ 3808$� w n e 3809$� ^ 3810$� l 3811 3812Note that "next_word_start" is *not* modified; this is important to ensure that 3813processing in the next iteration will continue with the following word 3814normally! 3815 3816$#</a> <!--lineBreaking--> 3817$#</a> <!--assignWidth--> 3818 3819$=$$<h3>$$$_calc_ywidth()$_$$</h3>$$ 3820 3821The third pass is simple again. calc_ywidth() calculates minimal heights 3822(y-widths) for all items. 3823 3824Presently this is complete overkill; assign_ywidth() doesn't really need this. 3825However, it will be necessary as soon as interesting elements are 3826implemented (tables) -- which hopefully won't be too long now... 3827 3828Like $$<a$+href="#calcWidth">$$$_calc_width()$_$$</a>$$$ , it traverses the tree 3829bottom to top, and for every item it stores the minimal height to "y_end". 3830 3831Blank lines always have the height of 1. Anchor items have no height. (They are virtual...) 3832 3833Text items have their height assigned already in 3834$$<a$+href="#assignWidth">$$$_assign_width()$_$$</a>$$$ , while $$<a 3835href="#lineBreaking">$$$_Breaking String into Lines$_$$</a>$$$ . 3836 3837$�################################################################################## 3838$� 3839$�+--------------------------------------------------------------------------------+___ 3840$�|header text |__0 3841$�+--------------------------------------------------------------------------------+ 1<--y_end 3842$� 3843$�**********************************************************************************___ 3844$�* *__0 3845$�********************************************************************************** 1<-- 3846$�[...] 3847$�+--------------------------------------------------------------------------------+___ 3848$�|this very long second paragraph contains some special characters (including a | 0 3849$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | 1 3850$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 2 3851$�|only tag with parameters); and finally a blank row |__3 3852$�+--------------------------------------------------------------------------------+ 4<-- 3853$� 3854$�+--------------------------------------------------------------------------------+___ 3855$�|(a single tag) |__0 3856$�+--------------------------------------------------------------------------------+ 1<-- 3857 3858The height of a box item is calculated by summing up the height of all its 3859sub-items. This is done by "first_child" and "next", just as seeking the widest 3860sub-item in $$<a$+href="#calcWidth">$$$_calc_width()$_$$</a>$$$ . 3861 3862$�################################################################################## 3863$� 3864$�+--------------------------------------------------------------------------------+ 3865$�|header text | 3866$�+--------------------------------------------------------------------------------+ <-- 1 3867$� 3868$�********************************************************************************** 3869$�* * 3870$�********************************************************************************** <-- +1 3871$�[...] [...] 3872$�+--------------------------------------------------------------------------------+ 3873$�|this very long second paragraph contains some special characters (including a | 3874$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | 3875$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 3876$�|only tag with parameters); and finally a blank row | 3877$�+--------------------------------------------------------------------------------+ <-- +4 3878$� 3879$�+--------------------------------------------------------------------------------+ 3880$�|(a single tag) | 3881$�+--------------------------------------------------------------------------------+ <-- +1 3882$� 3883$�################################################################################## <## 13 3884$� item_tree->y_end 3885 3886$#<a name="assignYwidth" id="assignYwidth"> 3887 3888$=$$<h3>$$$_assign_ywidth()$_$$</h3>$$ 3889 3890The fourth pass is also fairly simple. It assigns y-coordinates to all items. 3891There is no initialisation necessary before calling it, as the y_end of the 3892global item is already set by calc_ywidth(), and y_start is always 0. 3893 3894Like in $$<a$+href="#assignWidth">$$$_assign_width()$_$$</a>$$$ , the tree is 3895traversed top to bottom, and the sizes of all sub-items are assigned while 3896processing the parent. 3897 3898Thus, text and blank items do not need any processing -- they do not have any 3899children. 3900 3901For box items, the coordinates of all sub-items are assigned in a loop, like in 3902$$<a$+href="#assignWidth">$$$_assign_width()$_$$</a>$$$ . Every item is put 3903immediately after the previous one, i.e. it starts where the previous one ends. 3904We keep track of the current position by "y_pos". At the beginning it is 3905initialized to "y_start" of the box. 3906 3907$�################################################################################## 3908$� 0 <## <== y_pos 3909$� item_tree->y_start 3910 3911For every item, y_start is set to the current y_pos. 3912 3913$� y_end y_start 3914$� | | 3915$�+--###############################################################################____ v v 3916$�|header text |___0 <-- 3917$�* *___1 <-- <** 3918$�|heading |___2 <** <---- 3919$�* *___3 <---- <**** 3920$�|first paragraph of text; includes multiple spaces and newlines, emphasized text | 4 <**** <-- 3921$�|and strong text |___5 3922$�* *___6 <-- <** 3923$�|starting with an evil center tag, |___7 <** <---- 3924$�|this very long second paragraph contains some special characters (including a | 8 <---- <-- <== 3925$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | cur_item->y_start=y_pos=8 3926$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 3927$�|only tag with parameters); and finally a blank row | 3928$�+--------------------------------------------------------------------------------+ <-- cur_item->y_end=4 3929$� 3930$�+--------------------------------------------------------------------------------+ 3931$�|(a single tag) | 3932$�+--------------------------------------------------------------------------------+ <---- 1 3933$� 3934$�################################################################################## <## 13 3935 3936"y_end" of an item is determined by adding the y-size (stored in "y_end" up to 3937now) to "y_start" of the item. 3938 3939$�+--###############################################################################____ 3940$�|header text |___0 <-- 3941$�* *___1 <-- <** 3942$�|heading |___2 <** <---- 3943$�* *___3 <---- <**** 3944$�|first paragraph of text; includes multiple spaces and newlines, emphasized text | 4 <**** <-- 3945$�|and strong text |___5 3946$�* *___6 <-- <** 3947$�|starting with an evil center tag, |___7 <** <---- 3948$�|this very long second paragraph contains some special characters (including a | 8 <---- <-- <== 3949$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | 9 3950$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 10 3951$�|only tag with parameters); and finally a blank row |__11 3952$�+--------------------------------------------------------------------------------+ 12 <-- cur_item->y_end=cur_item->ystart+cur_item->y_end=12 3953$� 3954$�+--------------------------------------------------------------------------------+ 3955$�|(a single tag) | 3956$�+--------------------------------------------------------------------------------+ <---- 1 3957$� 3958$�################################################################################## <## 13 3959 3960"y_pos" is adjusted to the end of the item. 3961 3962$�+--###############################################################################____ 3963$�|header text |___0 <-- 3964$�* *___1 <-- <** 3965$�|heading |___2 <** <---- 3966$�* *___3 <---- <**** 3967$�|first paragraph of text; includes multiple spaces and newlines, emphasized text | 4 <**** <-- 3968$�|and strong text |___5 3969$�* *___6 <-- <** 3970$�|starting with an evil center tag, |___7 <** <---- 3971$�|this very long second paragraph contains some special characters (including a | 8 <---- <-- 3972$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | 9 3973$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 10 3974$�|only tag with parameters); and finally a blank row |__11 3975$�+--------------------------------------------------------------------------------+ 12 <-- <== 3976$� 3977$�+--------------------------------------------------------------------------------+ 3978$�|(a single tag) | 3979$�+--------------------------------------------------------------------------------+<---- 1 3980$� 3981$�################################################################################## <## 13 3982 3983In the next iteration, "y_pos" -- which now points to the end of the current 3984item -- is used as the beginning of the new item. 3985 3986$�+--###############################################################################____ 3987$�|header text |___0 <-- 3988$�* *___1 <-- <** 3989$�|heading |___2 <** <---- 3990$�* *___3 <---- <**** 3991$�|first paragraph of text; includes multiple spaces and newlines, emphasized text | 4 <**** <-- 3992$�|and strong text |___5 3993$�* *___6 <-- <** 3994$�|starting with an evil center tag, |___7 <** <---- 3995$�|this very long second paragraph contains some special characters (including a | 8 <---- <-- 3996$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | 9 3997$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 10 3998$�|only tag with parameters); and finally a blank row |__11 3999$�|(a single tag) | 12 <-- <---- <== 4000$�+--------------------------------------------------------------------------------+<---- 1 4001$� 4002$�################################################################################## <## 13 4003 4004$#<a name="linkCoords" id="linkCoords"> 4005 4006$=$$<h4>$$$_Link Coordinates$_$$</h4>$$ 4007 4008Coordinates of links and anchors (both x and y!) are presently also assigned 4009here. Probably it would be a better idea to do that in extra pass... As soon as 4010the current link implementation is dropped and anchors are cleanly implemented, 4011the assignment can be cleanly and logically split between assign_xwidth() and 4012assign_ywidth(), just as for other item types. 4013 4014Link coordinates are assigned by the parent text item. The text block is 4015scanned, one link after the other, for the lines containing the link start and 4016end. (The first line that ends after the link start, and the first line that 4017ends at/after the link end.) As an optimization, the search for the next link 4018is started in the line where the previous ended, not from beginning -- links 4019can't be nested. 4020 4021After having the line, (and thus also the y coordinate), the x coordinate is 4022calculated by adding the link's relative position inside the line 4023(link[].start-line_start) to the x coordinate of the line start. (Which is 4024equal to the item start for normal text, but has to be calculated seperately 4025for every line in centered text items.) 4026 4027Inline anchors are very similar; however, they get their coordinates while 4028processing the anchor item, not the text item containing them. (The text item 4029doesn't know anything about the anchors.) Anchors can be nested if created by 4030<span> or so; thus the above optimization isn't possible. (It would be harder 4031anyhow, due to the anchors being processed every one on its own.) 4032 4033Block anchors need another processing, of course: If they are empty, they keep 4034the coordinates assigned to them by the parent box; otherwise, all (immediate) 4035virtual children are scanned for the minimum/maximum for each of the four 4036coordinates; these are assigned as the coordinates of the anchor virtual box. 4037 4038$#</a> <!-- linkCoords --> 4039 4040$#</a> <!-- assignYwidth --> 4041 4042$#<a name="createMap" id="createMap"> 4043 4044$=$$<h3>$$$_create_map()$_$$</h3>$$ 4045 4046$#<a name="pageMap" id="pageMap"> 4047 4048The last sub-pass generates the "page_map[]". (This could be also done inside 4049assign_ywidth()...) This map is necessary to quickly determine which elements 4050show up in the visible area of the page, when it is displayed in the viewer. 4051 4052The page usage map stores references to all items that show up in any given 4053line of the output page. This is a very simple approach, but it should be 4054perfectly sufficient as long as netrik has no full graphic mode -- and this 4055will be probably for quite a while... 4056 4057"page_map[]" is an array containing an "Item_list" structure for every line of 4058the output page. This structure is declared in items.h . It contains a count of 4059items in this line, and an array of pointers to the items. 4060 4061$#</a> <!--pageMap--> 4062 4063The whole line map is allocated at the beginning. Afterwards, all items are 4064processed in a loop. For every item that is visible on the screen (presently 4065this are only text items), a reference to this item is stored to every line of 4066"page_map[]" between "y_start" and "y_end" of the item. 4067 4068$� line|page_map[line] 4069$�+--###############################################################################__ ----+-------------- 4070$�|header text |__ 0 | t0 4071$�* *__ 1 | - 4072$�|heading |__ 2 | t1 4073$�* *______3_|_-____ 4074$�|first paragraph of text; includes multiple spaces and newlines, emphasized text | 4 | t2 <-- cur_item->y_start 4075$�|and strong text |______5_|_t2___ 4076$�* *__ 6 | - <-- cur_item->y_end 4077$�|starting with an evil center tag, |__ 7 | - 4078$�|this very long second paragraph contains some special characters (including a | 8 | - 4079$�|simple space...): &; <>"=/ plus a big gap###and two unicode escapes (decimal: � | 9 | - 4080$�|and hexal: �) but also an anchor embedded inside a word (this anchor also is the| 10 | - 4081$�|only tag with parameters); and finally a blank row |__ 1 | - 4082$�|(a single tag) |__ 12 | - 4083$�###############################################################################--+ 4084 4085"page_map[]" is returned to pre_render(), and from there to main(), where it is 4086passed to $$<a$+href="#dump">$$$_dump()$_$$</a>$$ or to 4087$$<a$+href="#render">$$$_render()$_$$</a>$$ (via 4088$$<a$+href="hacking-pager.html#display">$$$_display()$_$$</a>$$$ , see 4089$$<a$+href="hacking-pager.html">$$hacking-pager.*$$</a>$$) along with "item_tree". 4090 4091$#</a> <!--createMap--> 4092 4093$#<a name="freeMap" id="freeMap"> 4094 4095$=$$<h3>$$$_free_map()$_$$</h3>$$ 4096 4097This function frees the memory allocated for the page map. 4098 4099First it goes through the table line by line, and frees the associated "list" 4100for each one. Afterwards, it frees the table itself. 4101 4102$#</a> <!--freeMap--> 4103 4104$#</a> <!--preRender--> 4105 4106$#<a name="renderC" id="renderC"> 4107 4108$=$$<h2>$$$_8. render.c$_$$</h2>$$ 4109 4110With the item tree and the page usage map prepared in pre_render(), we can now 4111actually render the page. There are two different rendering functions: 4112 4113dump() renders the whole page and dumps the output to the terminal. The output 4114is layouted correctly, using all the coordinates and text attributes. 4115 4116render() renders only a specified area of the page, and outputs it to the 4117curses screen. 4118 4119$#<a name="dump" id="dump"> 4120 4121$=$$<h3>$$$_dump()$_$$</h3>$$ 4122 4123The page is rendered line by line, using 4124"$$<a$+href="#pageMap">$$page_map[]$$</a>$$" (see 4125$$<a$+href="#createMap">$$$_create_map()$_$$</a>$$ above) to determine which 4126items we need to render in every line. Of course this isn't really necessary, 4127as we do not ever have more than one item in a line presently. However, using 4128"page_map[]" here is *not* overkill, just for a change ;-) On the contrary, 4129this is a pragmatic approach. Dumping the page item by item without using 4130"page_map[]" would be more efficient; however, it would be also more 4131complicated than dumping line by line. 4132 4133In each line, we process all items (from "page_map[line]") one after the other. 4134If there was actually more then one item (which is impossible presently...), 4135they would be printed one after the other -- with disasterous results... No 4136code is implemented for really handling this situation yet. 4137 4138First action to do for each item is setting the cursor position to the 4139beginning of the text of this line. (Retrieved by 4140$$<a$+href="#linePos">$$$_line_pos()$_$$</a>$$ and stored in "x_start".) This 4141we do by going forward as much character positions as necessary, by printing 4142that many space characters. 4143 4144Now we can output the text itself. To know what is actually to be printed in 4145this line, the start and end positions of this line's text inside the string 4146(text block) are retrieved with $$<a$+href="#lineStartEnd">$$$_line_start() and 4147line_end()$_$$</a>$$$ . (And stored in "text_start" and "text_end".) 4148 4149Having this, we print all attribute divisions in a loop. But first we have to 4150find the first one that shows up in the current line. 4151 4152$� div[0].end ... div[3].end 4153$� v v v v 4154$�text: "first[...]multiple spaces and newlines, emphasized text and strong text" 4155$� | ^ | ^ 4156$� text_start text_end 4157 4158In every iteration we print all the text between "div_start" end "div_end", 4159which normally point to the start and end of the current division. For the 4160first iteration, "div_start" is set to "text_start" -- we only want the part of 4161the div that acutally shows up in the line. 4162 4163$� v v v v 4164$�text: "first[...]multiple spaces and newlines, emphasized text and strong text" 4165$� ^ ^ ^ 4166$� div_start=text_start div_end=div[0].end text_end 4167$� 4168$�output: 4169$�first paragraph of text; contains multiple 4170$�spaces and newlines, <-- line 4171 4172The next div starts where the current one ends. 4173 4174$� v v v v 4175$�text: "first[...]multiple spaces and newlines, emphasized text and strong text" 4176$� ^ ^ ^ ^ 4177$� text_start div_start div_end text_end 4178 4179The last division is truncated to "text_end" -- as with the first one, we only 4180want the part that shows up in the current line. 4181 4182$� v v v v 4183$�text: "first[...]multiple spaces and newlines, emphasized text and strong text" 4184$� ^ ^ ^ 4185$� div_start div_end 4186$� =text_end 4187$� 4188$�output: 4189$�first paragraph of text; contains multiple 4190$�spaces and newlines, emphasized text and strong 4191 4192Before actually printing the text, it is copied into a temporary string, but 4193replacing all characters by normal spaces. (There are various problems 4194resulting from putting real '\xa0' characters on the screen: For one, with 4195fonts where this char isn't really blank (always the case if the charset isn't 4196iso-8859-x) it's unusable. Also, copying via screen/GPM/X clipboard usually has 4197undesired results.) 4198 4199Finally, we test if the line ends with a word break, and print the break 4200character if it does. We know it does when the character at the line end is a 4201word character (not a space, newline, or string end), because every word end is 4202followed by the space separating it from the next word; if there is no space at 4203the line end, we are inside a wrapped word. 4204 4205$� v v 4206$�text "Some sentence containing aVeryLongAndThusBrokenWord." 4207$� ^ 4208$�output: 4209$�Some sentence 4210$�containing aVeryLongAn\ 4211$�... 4212$�dThusBrokenWord. 4213 4214$#</a> <!--dump--> 4215 4216$#<a name="render" id="render"> 4217 4218$=$$<h3>$$$_render()$_$$</h3>$$ 4219 4220render() works similar to $$<a$+href="#dump">$$$_dump()$_$$</a>$$$ . The main difference is how the output is 4221printed. However, there are a couple of differences in screen position handling 4222and other calculations also. 4223 4224render() takes the starting positon of the rendered area inside the page, the 4225starting position on the screen, and the size of the rendered area as 4226arguments. 4227 4228The area is processed line by line, and every line is processed item by item, 4229just like in dump(). 4230 4231"x_start" describes the starting column relative to the beginning of the 4232rendered area, not the screen. (dump() always dumps whole lines, and thus there 4233is no difference.) If it turns out that the line starts before the rendered 4234area, it has to be truncated. the ending position "x_end" is calculated in a 4235similar fashion. 4236 4237Before the line is printed (div by div, as in dump()), the cursor is set to the 4238start position of the line. The column is the starting position relative to the 4239area ("x_start"), plus the starting position of the area on the screen; 4240likewise the row. 4241 4242Before anything is printed, we test if some part of the line shows up inside 4243the rendered area at all. The word break indicator is also printed only if the 4244line ends inside the area. 4245 4246If render() was called with the "overpaint" flag, the requested area is cleaned 4247before rendering anything, so any garbage will be removed. (Areas not 4248containing any text aren't affected otherwise.) This is done by overwriting the 4249desired part of each line with a string of spaces. 4250 4251$#</a> <!--render--> 4252 4253$#<a name="dumpItems" id="dumpItems"> 4254 4255$=$$<h3>$$$_dump_items()$_$$</h3>$$ 4256 4257dump_items() dumps the item tree, including the text of text items. The text is 4258printed with correct attributes, but ignoring any coordinates and line breaks. 4259This function is for debugging purposes, and may be called anywhere inside or 4260after pre_render() (anywhere after parse_struct()). 4261 4262The reason this function is in render.c is that it needs the same screen 4263handling functions as $$<a$+href="#dump">$$$_dump()$_$$</a>$$$ . Moreover, it 4264works in a very similar fashion. 4265 4266The difference is that it does not dump line by line, but item by item 4267(traversing the tree top to bottom). After printing some information about the 4268item itself, it dumps the text division by division in the same way dump() 4269does, only it doesn't need to care about positions or line breaks; it dumps the 4270whole string at once. 4271 4272$#</a> <!--dumpItems--> 4273 4274$#</a> <!--renderC--> 4275 4276$#<a name="itemsC" id="itemsC"> 4277 4278$=$$<h2>$$$_10. items.c$_$$</h2>$$ 4279 4280items.c contains a few simple helper functions intended to simplify retrieving 4281some common date from the structure (item) tree. 4282 4283The advantage of using such helper functions, even if they are really simple, 4284is less code duplication -- which improves maintainability and probably makes 4285the code also easier to understand. (This is actually an approach toward 4286so-called object oriented programming...) On the other hand, these functions 4287are extremly inefficient, as they calculate intermediate values which could be 4288shared, and the calling overhead itself is fairly big for such simple 4289functions. (Effectively probably increasing code size.) Maybe we should try to 4290define them as macros, or just put them into an include file so they can be 4291inlined during optimized compilation. (Let the compiler decide...) 4292 4293$#<a name="lineStartEnd" id="lineStartEnd"> 4294 4295$=$$<h3>$$$_line_start() and line_end()$_$$</h3>$$ 4296 4297line_start() and line_end() are used to find out at which position some 4298specific text line starts/ends inside the string of a (wrapped) text block. 4299 4300The positions are read from the line_table[] (see 4301$$<a$+href="lineBreaking">$$$_Breaking String into Lines$_$$</a>$$). This is 4302normally trivial, but there are exceptions for the first and last lines. 4303 4304line_end() has an additional quirk: If the line ends with a blank, the line end 4305position is decremented -- when a line wraps at a blank, this blank is always 4306discarded. 4307 4308Note that the line number given to these functions is *not* the line number 4309relative to the start of the text block, but the page line. 4310 4311$#</a> <!-- lineStartEnd --> 4312 4313$#<a name="linePos" id="linePos"> 4314 4315$=$$<h3>$$$_line_pos()$_$$</h3>$$ 4316 4317The horizontal page position of a single text line can be retrieved using 4318line_pos(). This is useful because of centered text items, where the line 4319starts do not equal the text block start, and differ from line to line. 4320 4321The line number argument is in page coordinates just as in 4322$$<a$+href="#lineStartEnd">$$$_line_start() and line_end()$_$$</a>$$ above. 4323 4324$#</a> <!-- linePos --> 4325 4326$#</a> <!-- itemsC --> 4327 4328$#</body> 4329$#</html> 4330