Blob


1 .TH HTML 3
2 .SH NAME
3 parsehtml,
4 printitems,
5 validitems,
6 freeitems,
7 freedocinfo,
8 dimenkind,
9 dimenspec,
10 targetid,
11 targetname,
12 fromStr,
13 toStr
14 \- HTML parser
15 .SH SYNOPSIS
16 .nf
17 .PP
18 .ft L
19 #include <u.h>
20 #include <libc.h>
21 #include <html.h>
22 .ft P
23 .PP
24 .ta \w'\fLToken* 'u
25 .B
26 Item* parsehtml(uchar* data, int datalen, Rune* src, int mtype,
27 .B
28 int chset, Docinfo** pdi)
29 .PP
30 .B
31 void printitems(Item* items, char* msg)
32 .PP
33 .B
34 int validitems(Item* items)
35 .PP
36 .B
37 void freeitems(Item* items)
38 .PP
39 .B
40 void freedocinfo(Docinfo* d)
41 .PP
42 .B
43 int dimenkind(Dimen d)
44 .PP
45 .B
46 int dimenspec(Dimen d)
47 .PP
48 .B
49 int targetid(Rune* s)
50 .PP
51 .B
52 Rune* targetname(int targid)
53 .PP
54 .B
55 uchar* fromStr(Rune* buf, int n, int chset)
56 .PP
57 .B
58 Rune* toStr(uchar* buf, int n, int chset)
59 .SH DESCRIPTION
60 .PP
61 This library implements a parser for HTML 4.0 documents.
62 The parsed HTML is converted into an intermediate representation that
63 describes how the formatted HTML should be laid out.
64 .PP
65 .I Parsehtml
66 parses an entire HTML document contained in the buffer
67 .I data
68 and having length
69 .IR datalen .
70 The URL of the document should be passed in as
71 .IR src .
72 .I Mtype
73 is the media type of the document, which should be either
74 .B TextHtml
75 or
76 .BR TextPlain .
77 The character set of the document is described in
78 .IR chset ,
79 which can be one of
80 .BR US_Ascii ,
81 .BR ISO_8859_1 ,
82 .B UTF_8
83 or
84 .BR Unicode .
85 The return value is a linked list of
86 .B Item
87 structures, described in detail below.
88 As a side effect,
89 .BI * pdi
90 is set to point to a newly created
91 .B Docinfo
92 structure, containing information pertaining to the entire document.
93 .PP
94 The library expects two allocation routines to be provided by the
95 caller,
96 .B emalloc
97 and
98 .BR erealloc .
99 These routines are analogous to the standard malloc and realloc routines,
100 except that they should not return if the memory allocation fails.
101 In addition,
102 .B emalloc
103 is required to zero the memory.
104 .PP
105 For debugging purposes,
106 .I printitems
107 may be called to display the contents of an item list; individual items may
108 be printed using the
109 .B %I
110 print verb, installed on the first call to
111 .IR parsehtml .
112 .I validitems
113 traverses the item list, checking that all of the pointers are valid.
114 It returns
115 .B 1
116 is everything is ok, and
117 .B 0
118 if an error was found.
119 Normally, one would not call these routines directly.
120 Instead, one sets the global variable
121 .I dbgbuild
122 and the library calls them automatically.
123 One can also set
124 .IR warn ,
125 to cause the library to print a warning whenever it finds a problem with the
126 input document, and
127 .IR dbglex ,
128 to print debugging information in the lexer.
129 .PP
130 When an item list is finished with, it should be freed with
131 .IR freeitems .
132 Then,
133 .I freedocinfo
134 should be called on the pointer returned in
135 .BI * pdi\f1.
136 .PP
137 .I Dimenkind
138 and
139 .I dimenspec
140 are provided to interpret the
141 .B Dimen
142 type, as described in the section
143 .IR "Dimension Specifications" .
144 .PP
145 Frame target names are mapped to integer ids via a global, permanent mapping.
146 To find the value for a given name, call
147 .IR targetid ,
148 which allocates a new id if the name hasn't been seen before.
149 The name of a given, known id may be retrieved using
150 .IR targetname .
151 The library predefines
152 .BR FTtop ,
153 .BR FTself ,
154 .B FTparent
155 and
156 .BR FTblank .
157 .PP
158 The library handles all text as Unicode strings (type
159 .BR Rune* ).
160 Character set conversion is provided by
161 .I fromStr
162 and
163 .IR toStr .
164 .I FromStr
165 takes
166 .I n
167 Unicode characters from
168 .I buf
169 and converts them to the character set described by
170 .IR chset .
171 .I ToStr
172 takes
173 .I n
174 bytes from
175 .IR buf ,
176 interpretted as belonging to character set
177 .IR chset ,
178 and converts them to a Unicode string.
179 Both routines null-terminate the result, and use
180 .B emalloc
181 to allocate space for it.
182 .SS Items
183 The return value of
184 .I parsehtml
185 is a linked list of variant structures,
186 with the generic portion described by the following definition:
187 .PP
188 .EX
189 .ta 6n +\w'Genattr* 'u
190 typedef struct Item Item;
191 struct Item
193 Item* next;
194 int width;
195 int height;
196 int ascent;
197 int anchorid;
198 int state;
199 Genattr* genattr;
200 int tag;
201 };
202 .EE
203 .PP
204 The field
205 .B next
206 points to the successor in the linked list of items, while
207 .BR width ,
208 .BR height ,
209 and
210 .B ascent
211 are intended for use by the caller as part of the layout process.
212 .BR Anchorid ,
213 if non-zero, gives the integer id assigned by the parser to the anchor that
214 this item is in (see section
215 .IR Anchors ).
216 .B State
217 is a collection of flags and values described as follows:
218 .PP
219 .EX
220 .ta 6n +\w'IFindentshift = 'u
221 enum
223 IFbrk = 0x80000000,
224 IFbrksp = 0x40000000,
225 IFnobrk = 0x20000000,
226 IFcleft = 0x10000000,
227 IFcright = 0x08000000,
228 IFwrap = 0x04000000,
229 IFhang = 0x02000000,
230 IFrjust = 0x01000000,
231 IFcjust = 0x00800000,
232 IFsmap = 0x00400000,
233 IFindentshift = 8,
234 IFindentmask = (255<<IFindentshift),
235 IFhangmask = 255
236 };
237 .EE
238 .PP
239 .B IFbrk
240 is set if a break is to be forced before placing this item.
241 .B IFbrksp
242 is set if a 1 line space should be added to the break (in which case
243 .B IFbrk
244 is also set).
245 .B IFnobrk
246 is set if a break is not permitted before the item.
247 .B IFcleft
248 is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
249 before this item is placed, and
250 .B IFcright
251 is set for right floats.
252 In both cases, IFbrk is also set.
253 .B IFwrap
254 is set if the line containing this item is allowed to wrap.
255 .B IFhang
256 is set if this item hangs into the left indent.
257 .B IFrjust
258 is set if the line containing this item should be right justified,
259 and
260 .B IFcjust
261 is set for center justified lines.
262 .B IFsmap
263 is used to indicate that an image is a server-side map.
264 The low 8 bits, represented by
265 .BR IFhangmask ,
266 indicate the current hang into left indent, in tenths of a tabstop.
267 The next 8 bits, represented by
268 .B IFindentmask
269 and
270 .BR IFindentshift ,
271 indicate the current indent in tab stops.
272 .PP
273 The field
274 .B genattr
275 is an optional pointer to an auxiliary structure, described in the section
276 .IR "Generic Attributes" .
277 .PP
278 Finally,
279 .B tag
280 describes which variant type this item has.
281 It can have one of the values
282 .BR Itexttag ,
283 .BR Iruletag ,
284 .BR Iimagetag ,
285 .BR Iformfieldtag ,
286 .BR Itabletag ,
287 .B Ifloattag
288 or
289 .BR Ispacertag .
290 For each of these values, there is an additional structure defined, which
291 includes Item as an unnamed initial substructure, and then defines additional
292 fields.
293 .PP
294 Items of type
295 .B Itexttag
296 represent a piece of text, using the following structure:
297 .PP
298 .EX
299 .ta 6n +\w'Rune* 'u
300 struct Itext
302 Item;
303 Rune* s;
304 int fnt;
305 int fg;
306 uchar voff;
307 uchar ul;
308 };
309 .EE
310 .PP
311 Here
312 .B s
313 is a null-terminated Unicode string of the actual characters making up this text item,
314 .B fnt
315 is the font number (described in the section
316 .IR "Font Numbers" ),
317 and
318 .B fg
319 is the RGB encoded color for the text.
320 .B Voff
321 measures the vertical offset from the baseline; subtract
322 .B Voffbias
323 to get the actual value (negative values represent a displacement down the page).
324 The field
325 .B ul
326 is the underline style:
327 .B ULnone
328 if no underline,
329 .B ULunder
330 for conventional underline, and
331 .B ULmid
332 for strike-through.
333 .PP
334 Items of type
335 .B Iruletag
336 represent a horizontal rule, as follows:
337 .PP
338 .EX
339 .ta 6n +\w'Dimen 'u
340 struct Irule
342 Item;
343 uchar align;
344 uchar noshade;
345 int size;
346 Dimen wspec;
347 };
348 .EE
349 .PP
350 Here
351 .B align
352 is the alignment specification (described in the corresponding section),
353 .B noshade
354 is set if the rule should not be shaded,
355 .B size
356 is the height of the rule (as set by the size attribute),
357 and
358 .B wspec
359 is the desired width (see section
360 .IR "Dimension Specifications" ).
361 .PP
362 Items of type
363 .B Iimagetag
364 describe embedded images, for which the following structure is defined:
365 .PP
366 .EX
367 .ta 6n +\w'Iimage* 'u
368 struct Iimage
370 Item;
371 Rune* imsrc;
372 int imwidth;
373 int imheight;
374 Rune* altrep;
375 Map* map;
376 int ctlid;
377 uchar align;
378 uchar hspace;
379 uchar vspace;
380 uchar border;
381 Iimage* nextimage;
382 };
383 .EE
384 .PP
385 Here
386 .B imsrc
387 is the URL of the image source,
388 .B imwidth
389 and
390 .BR imheight ,
391 if non-zero, contain the specified width and height for the image,
392 and
393 .B altrep
394 is the text to use as an alternative to the image, if the image is not displayed.
395 .BR Map ,
396 if set, points to a structure describing an associated client-side image map.
397 .B Ctlid
398 is reserved for use by the application, for handling animated images.
399 .B Align
400 encodes the alignment specification of the image.
401 .B Hspace
402 contains the number of pixels to pad the image with on either side, and
403 .B Vspace
404 the padding above and below.
405 .B Border
406 is the width of the border to draw around the image.
407 .B Nextimage
408 points to the next image in the document (the head of this list is
409 .BR Docinfo.images ).
410 .PP
411 For items of type
412 .BR Iformfieldtag ,
413 the following structure is defined:
414 .PP
415 .EX
416 .ta 6n +\w'Formfield* 'u
417 struct Iformfield
419 Item;
420 Formfield* formfield;
421 };
422 .EE
423 .PP
424 This adds a single field,
425 .BR formfield ,
426 which points to a structure describing a field in a form, described in section
427 .IR Forms .
428 .PP
429 For items of type
430 .BR Itabletag ,
431 the following structure is defined:
432 .PP
433 .EX
434 .ta 6n +\w'Table* 'u
435 struct Itable
437 Item;
438 Table* table;
439 };
440 .EE
441 .PP
442 .B Table
443 points to a structure describing the table, described in the section
444 .IR Tables .
445 .PP
446 For items of type
447 .BR Ifloattag ,
448 the following structure is defined:
449 .PP
450 .EX
451 .ta 6n +\w'Ifloat* 'u
452 struct Ifloat
454 Item;
455 Item* item;
456 int x;
457 int y;
458 uchar side;
459 uchar infloats;
460 Ifloat* nextfloat;
461 };
462 .EE
463 .PP
464 The
465 .B item
466 points to a single item (either a table or an image) that floats (the text of the
467 document flows around it), and
468 .B side
469 indicates the margin that this float sticks to; it is either
470 .B ALleft
471 or
472 .BR ALright .
473 .B X
474 and
475 .B y
476 are reserved for use by the caller; these are typically used for the coordinates
477 of the top of the float.
478 .B Infloats
479 is used by the caller to keep track of whether it has placed the float.
480 .B Nextfloat
481 is used by the caller to link together all of the floats that it has placed.
482 .PP
483 For items of type
484 .BR Ispacertag ,
485 the following structure is defined:
486 .PP
487 .EX
488 .ta 6n +\w'Item; 'u
489 struct Ispacer
491 Item;
492 int spkind;
493 };
494 .EE
495 .PP
496 .B Spkind
497 encodes the kind of spacer, and may be one of
498 .B ISPnull
499 (zero height and width),
500 .B ISPvline
501 (takes on height and ascent of the current font),
502 .B ISPhspace
503 (has the width of a space in the current font) and
504 .B ISPgeneral
505 (for all other purposes, such as between markers and lists).
506 .SS Generic Attributes
507 .PP
508 The genattr field of an item, if non-nil, points to a structure that holds
509 the values of attributes not specific to any particular
510 item type, as they occur on a wide variety of underlying HTML tags.
511 The structure is as follows:
512 .PP
513 .EX
514 .ta 6n +\w'SEvent* 'u
515 typedef struct Genattr Genattr;
516 struct Genattr
518 Rune* id;
519 Rune* class;
520 Rune* style;
521 Rune* title;
522 SEvent* events;
523 };
524 .EE
525 .PP
526 Fields
527 .BR id ,
528 .BR class ,
529 .B style
530 and
531 .BR title ,
532 when non-nil, contain values of correspondingly named attributes of the HTML tag
533 associated with this item.
534 .B Events
535 is a linked list of events (with corresponding scripted actions) associated with the item:
536 .PP
537 .EX
538 .ta 6n +\w'SEvent* 'u
539 typedef struct SEvent SEvent;
540 struct SEvent
542 SEvent* next;
543 int type;
544 Rune* script;
545 };
546 .EE
547 .PP
548 Here,
549 .B next
550 points to the next event in the list,
551 .B type
552 is one of
553 .BR SEonblur ,
554 .BR SEonchange ,
555 .BR SEonclick ,
556 .BR SEondblclick ,
557 .BR SEonfocus ,
558 .BR SEonkeypress ,
559 .BR SEonkeyup ,
560 .BR SEonload ,
561 .BR SEonmousedown ,
562 .BR SEonmousemove ,
563 .BR SEonmouseout ,
564 .BR SEonmouseover ,
565 .BR SEonmouseup ,
566 .BR SEonreset ,
567 .BR SEonselect ,
568 .B SEonsubmit
569 or
570 .BR SEonunload ,
571 and
572 .B script
573 is the text of the associated script.
574 .SS Dimension Specifications
575 .PP
576 Some structures include a dimension specification, used where
577 a number can be followed by a
578 .B %
579 or a
580 .B *
581 to indicate
582 percentage of total or relative weight.
583 This is encoded using the following structure:
584 .PP
585 .EX
586 .ta 6n +\w'int 'u
587 typedef struct Dimen Dimen;
588 struct Dimen
590 int kindspec;
591 };
592 .EE
593 .PP
594 Separate kind and spec values are extracted using
595 .I dimenkind
596 and
597 .IR dimenspec .
598 .I Dimenkind
599 returns one of
600 .BR Dnone ,
601 .BR Dpixels ,
602 .B Dpercent
603 or
604 .BR Drelative .
605 .B Dnone
606 means that no dimension was specified.
607 In all other cases,
608 .I dimenspec
609 should be called to find the absolute number of pixels, the percentage of total,
610 or the relative weight.
611 .SS Background Specifications
612 .PP
613 It is possible to set the background of the entire document, and also
614 for some parts of the document (such as tables).
615 This is encoded as follows:
616 .PP
617 .EX
618 .ta 6n +\w'Rune* 'u
619 typedef struct Background Background;
620 struct Background
622 Rune* image;
623 int color;
624 };
625 .EE
626 .PP
627 .BR Image ,
628 if non-nil, is the URL of an image to use as the background.
629 If this is nil,
630 .B color
631 is used instead, as the RGB value for a solid fill color.
632 .SS Alignment Specifications
633 .PP
634 Certain items have alignment specifiers taken from the following
635 enumerated type:
636 .PP
637 .EX
638 .ta 6n
639 enum
641 ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
642 ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
643 };
644 .EE
645 .PP
646 These values correspond to the various alignment types named in the HTML 4.0
647 standard.
648 If an item has an alignment of
649 .B ALleft
650 or
651 .BR ALright ,
652 the library automatically encapsulates it inside a float item.
653 .PP
654 Tables, and the various rows, columns and cells within them, have a more
655 complex alignment specification, composed of separate vertical and
656 horizontal alignments:
657 .PP
658 .EX
659 .ta 6n +\w'uchar 'u
660 typedef struct Align Align;
661 struct Align
663 uchar halign;
664 uchar valign;
665 };
666 .EE
667 .PP
668 .B Halign
669 can be one of
670 .BR ALnone ,
671 .BR ALleft ,
672 .BR ALcenter ,
673 .BR ALright ,
674 .B ALjustify
675 or
676 .BR ALchar .
677 .B Valign
678 can be one of
679 .BR ALnone ,
680 .BR ALmiddle ,
681 .BR ALbottom ,
682 .BR ALtop
683 or
684 .BR ALbaseline .
685 .SS Font Numbers
686 .PP
687 Text items have an associated font number (the
688 .B fnt
689 field), which is encoded as
690 .BR style*NumSize+size .
691 Here,
692 .B style
693 is one of
694 .BR FntR ,
695 .BR FntI ,
696 .B FntB
697 or
698 .BR FntT ,
699 for roman, italic, bold and typewriter font styles, respectively, and size is
700 .BR Tiny ,
701 .BR Small ,
702 .BR Normal ,
703 .B Large
704 or
705 .BR Verylarge .
706 The total number of possible font numbers is
707 .BR NumFnt ,
708 and the default font number is
709 .B DefFnt
710 (which is roman style, normal size).
711 .SS Document Info
712 .PP
713 Global information about an HTML page is stored in the following structure:
714 .PP
715 .EX
716 .ta 6n +\w'DestAnchor* 'u
717 typedef struct Docinfo Docinfo;
718 struct Docinfo
720 // stuff from HTTP headers, doc head, and body tag
721 Rune* src;
722 Rune* base;
723 Rune* doctitle;
724 Background background;
725 Iimage* backgrounditem;
726 int text;
727 int link;
728 int vlink;
729 int alink;
730 int target;
731 int chset;
732 int mediatype;
733 int scripttype;
734 int hasscripts;
735 Rune* refresh;
736 Kidinfo* kidinfo;
737 int frameid;
739 // info needed to respond to user actions
740 Anchor* anchors;
741 DestAnchor* dests;
742 Form* forms;
743 Table* tables;
744 Map* maps;
745 Iimage* images;
746 };
747 .EE
748 .PP
749 .B Src
750 gives the URL of the original source of the document,
751 and
752 .B base
753 is the base URL.
754 .B Doctitle
755 is the document's title, as set by a
756 .B <title>
757 element.
758 .B Background
759 is as described in the section
760 .IR "Background Specifications" ,
761 and
762 .B backgrounditem
763 is set to be an image item for the document's background image (if given as a URL),
764 or else nil.
765 .B Text
766 gives the default foregound text color of the document,
767 .B link
768 the unvisited hyperlink color,
769 .B vlink
770 the visited hyperlink color, and
771 .B alink
772 the color for highlighting hyperlinks (all in 24-bit RGB format).
773 .B Target
774 is the default target frame id.
775 .B Chset
776 and
777 .B mediatype
778 are as for the
779 .I chset
780 and
781 .I mtype
782 parameters to
783 .IR parsehtml .
784 .B Scripttype
785 is the type of any scripts contained in the document, and is always
786 .BR TextJavascript .
787 .B Hasscripts
788 is set if the document contains any scripts.
789 Scripting is currently unsupported.
790 .B Refresh
791 is the contents of a
792 .B "<meta http-equiv=Refresh ...>"
793 tag, if any.
794 .B Kidinfo
795 is set if this document is a frameset (see section
796 .IR Frames ).
797 .B Frameid
798 is this document's frame id.
799 .PP
800 .B Anchors
801 is a list of hyperlinks contained in the document,
802 and
803 .B dests
804 is a list of hyperlink destinations within the page (see the following section for details).
805 .BR Forms ,
806 .B tables
807 and
808 .B maps
809 are lists of the various forms, tables and client-side maps contained
810 in the document, as described in subsequent sections.
811 .B Images
812 is a list of all the image items in the document.
813 .SS Anchors
814 .PP
815 The library builds two lists for all of the
816 .B <a>
817 elements (anchors) in a document.
818 Each anchor is assigned a unique anchor id within the document.
819 For anchors which are hyperlinks (the
820 .B href
821 attribute was supplied), the following structure is defined:
822 .PP
823 .EX
824 .ta 6n +\w'Anchor* 'u
825 typedef struct Anchor Anchor;
826 struct Anchor
828 Anchor* next;
829 int index;
830 Rune* name;
831 Rune* href;
832 int target;
833 };
834 .EE
835 .PP
836 .B Next
837 points to the next anchor in the list (the head of this list is
838 .BR Docinfo.anchors ).
839 .B Index
840 is the anchor id; each item within this hyperlink is tagged with this value
841 in its
842 .B anchorid
843 field.
844 .B Name
845 and
846 .B href
847 are the values of the correspondingly named attributes of the anchor
848 (in particular, href is the URL to go to).
849 .B Target
850 is the value of the target attribute (if provided) converted to a frame id.
851 .PP
852 Destinations within the document (anchors with the name attribute set)
853 are held in the
854 .B Docinfo.dests
855 list, using the following structure:
856 .PP
857 .EX
858 .ta 6n +\w'DestAnchor* 'u
859 typedef struct DestAnchor DestAnchor;
860 struct DestAnchor
862 DestAnchor* next;
863 int index;
864 Rune* name;
865 Item* item;
866 };
867 .EE
868 .PP
869 .B Next
870 is the next element of the list,
871 .B index
872 is the anchor id,
873 .B name
874 is the value of the name attribute, and
875 .B item
876 is points to the item within the parsed document that should be considered
877 to be the destination.
878 .SS Forms
879 .PP
880 Any forms within a document are kept in a list, headed by
881 .BR Docinfo.forms .
882 The elements of this list are as follows:
883 .PP
884 .EX
885 .ta 6n +\w'Formfield* 'u
886 typedef struct Form Form;
887 struct Form
889 Form* next;
890 int formid;
891 Rune* name;
892 Rune* action;
893 int target;
894 int method;
895 int nfields;
896 Formfield* fields;
897 };
898 .EE
899 .PP
900 .B Next
901 points to the next form in the list.
902 .B Formid
903 is a serial number for the form within the document.
904 .B Name
905 is the value of the form's name or id attribute.
906 .B Action
907 is the value of any action attribute.
908 .B Target
909 is the value of the target attribute (if any) converted to a frame target id.
910 .B Method
911 is one of
912 .B HGet
913 or
914 .BR HPost .
915 .B Nfields
916 is the number of fields in the form, and
917 .B fields
918 is a linked list of the actual fields.
919 .PP
920 The individual fields in a form are described by the following structure:
921 .PP
922 .EX
923 .ta 6n +\w'Formfield* 'u
924 typedef struct Formfield Formfield;
925 struct Formfield
927 Formfield* next;
928 int ftype;
929 int fieldid;
930 Form* form;
931 Rune* name;
932 Rune* value;
933 int size;
934 int maxlength;
935 int rows;
936 int cols;
937 uchar flags;
938 Option* options;
939 Item* image;
940 int ctlid;
941 SEvent* events;
942 };
943 .EE
944 .PP
945 Here,
946 .B next
947 points to the next field in the list.
948 .B Ftype
949 is the type of the field, which can be one of
950 .BR Ftext ,
951 .BR Fpassword ,
952 .BR Fcheckbox ,
953 .BR Fradio ,
954 .BR Fsubmit ,
955 .BR Fhidden ,
956 .BR Fimage ,
957 .BR Freset ,
958 .BR Ffile ,
959 .BR Fbutton ,
960 .B Fselect
961 or
962 .BR Ftextarea .
963 .B Fieldid
964 is a serial number for the field within the form.
965 .B Form
966 points back to the form containing this field.
967 .BR Name ,
968 .BR value ,
969 .BR size ,
970 .BR maxlength ,
971 .B rows
972 and
973 .B cols
974 each contain the values of corresponding attributes of the field, if present.
975 .B Flags
976 contains per-field flags, of which
977 .B FFchecked
978 and
979 .B FFmultiple
980 are defined.
981 .B Image
982 is only used for fields of type
983 .BR Fimage ;
984 it points to an image item containing the image to be displayed.
985 .B Ctlid
986 is reserved for use by the caller, typically to store a unique id
987 of an associated control used to implement the field.
988 .B Events
989 is the same as the corresponding field of the generic attributes
990 associated with the item containing this field.
991 .B Options
992 is only used by fields of type
993 .BR Fselect ;
994 it consists of a list of possible options that may be selected for that
995 field, using the following structure:
996 .PP
997 .EX
998 .ta 6n +\w'Option* 'u
999 typedef struct Option Option;
1000 struct Option
1002 Option* next;
1003 int selected;
1004 Rune* value;
1005 Rune* display;
1007 .EE
1008 .PP
1009 .B Next
1010 points to the next element of the list.
1011 .B Selected
1012 is set if this option is to be displayed initially.
1013 .B Value
1014 is the value to send when the form is submitted if this option is selected.
1015 .B Display
1016 is the string to display on the screen for this option.
1017 .SS Tables
1018 .PP
1019 The library builds a list of all the tables in the document,
1020 headed by
1021 .BR Docinfo.tables .
1022 Each element of this list has the following format:
1023 .PP
1024 .EX
1025 .ta 6n +\w'Tablecell*** 'u
1026 typedef struct Table Table;
1027 struct Table
1029 Table* next;
1030 int tableid;
1031 Tablerow* rows;
1032 int nrow;
1033 Tablecol* cols;
1034 int ncol;
1035 Tablecell* cells;
1036 int ncell;
1037 Tablecell*** grid;
1038 Align align;
1039 Dimen width;
1040 int border;
1041 int cellspacing;
1042 int cellpadding;
1043 Background background;
1044 Item* caption;
1045 uchar caption_place;
1046 Lay* caption_lay;
1047 int totw;
1048 int toth;
1049 int caph;
1050 int availw;
1051 Token* tabletok;
1052 uchar flags;
1054 .EE
1055 .PP
1056 .B Next
1057 points to the next element in the list of tables.
1058 .B Tableid
1059 is a serial number for the table within the document.
1060 .B Rows
1061 is an array of row specifications (described below) and
1062 .B nrow
1063 is the number of elements in this array.
1064 Similarly,
1065 .B cols
1066 is an array of column specifications, and
1067 .B ncol
1068 the size of this array.
1069 .B Cells
1070 is a list of all cells within the table (structure described below)
1071 and
1072 .B ncell
1073 is the number of elements in this list.
1074 Note that a cell may span multiple rows and/or columns, thus
1075 .B ncell
1076 may be smaller than
1077 .BR nrow*ncol .
1078 .B Grid
1079 is a two-dimensional array of cells within the table; the cell
1080 at row
1081 .B i
1082 and column
1083 .B j
1085 .BR Table.grid[i][j] .
1086 A cell that spans multiple rows and/or columns will
1087 be referenced by
1088 .B grid
1089 multiple times, however it will only occur once in
1090 .BR cells .
1091 .B Align
1092 gives the alignment specification for the entire table,
1093 and
1094 .B width
1095 gives the requested width as a dimension specification.
1096 .BR Border ,
1097 .B cellspacing
1098 and
1099 .B cellpadding
1100 give the values of the corresponding attributes for the table,
1101 and
1102 .B background
1103 gives the requested background for the table.
1104 .B Caption
1105 is a linked list of items to be displayed as the caption of the
1106 table, either above or below depending on whether
1107 .B caption_place
1109 .B ALtop
1111 .BR ALbottom .
1112 Most of the remaining fields are reserved for use by the caller,
1113 except
1114 .BR tabletok ,
1115 which is reserved for internal use.
1116 The type
1117 .B Lay
1118 is not defined by the library; the caller can provide its
1119 own definition.
1120 .PP
1121 The
1122 .B Tablecol
1123 structure is defined for use by the caller.
1124 The library ensures that the correct number of these
1125 is allocated, but leaves them blank.
1126 The fields are as follows:
1127 .PP
1128 .EX
1129 .ta 6n +\w'Point 'u
1130 typedef struct Tablecol Tablecol;
1131 struct Tablecol
1133 int width;
1134 Align align;
1135 Point pos;
1137 .EE
1138 .PP
1139 The rows in the table are specified as follows:
1140 .PP
1141 .EX
1142 .ta 6n +\w'Background 'u
1143 typedef struct Tablerow Tablerow;
1144 struct Tablerow
1146 Tablerow* next;
1147 Tablecell* cells;
1148 int height;
1149 int ascent;
1150 Align align;
1151 Background background;
1152 Point pos;
1153 uchar flags;
1155 .EE
1156 .PP
1157 .B Next
1158 is only used during parsing; it should be ignored by the caller.
1159 .B Cells
1160 provides a list of all the cells in a row, linked through their
1161 .B nextinrow
1162 fields (see below).
1163 .BR Height ,
1164 .B ascent
1165 and
1166 .B pos
1167 are reserved for use by the caller.
1168 .B Align
1169 is the alignment specification for the row, and
1170 .B background
1171 is the background to use, if specified.
1172 .B Flags
1173 is used by the parser; ignore this field.
1174 .PP
1175 The individual cells of the table are described as follows:
1176 .PP
1177 .EX
1178 .ta 6n +\w'Background 'u
1179 typedef struct Tablecell Tablecell;
1180 struct Tablecell
1182 Tablecell* next;
1183 Tablecell* nextinrow;
1184 int cellid;
1185 Item* content;
1186 Lay* lay;
1187 int rowspan;
1188 int colspan;
1189 Align align;
1190 uchar flags;
1191 Dimen wspec;
1192 int hspec;
1193 Background background;
1194 int minw;
1195 int maxw;
1196 int ascent;
1197 int row;
1198 int col;
1199 Point pos;
1201 .EE
1202 .PP
1203 .B Next
1204 is used to link together the list of all cells within a table
1205 .RB ( Table.cells ),
1206 whereas
1207 .B nextinrow
1208 is used to link together all the cells within a single row
1209 .RB ( Tablerow.cells ).
1210 .B Cellid
1211 provides a serial number for the cell within the table.
1212 .B Content
1213 is a linked list of the items to be laid out within the cell.
1214 .B Lay
1215 is reserved for the user to describe how these items have
1216 been laid out.
1217 .B Rowspan
1218 and
1219 .B colspan
1220 are the number of rows and columns spanned by this cell,
1221 respectively.
1222 .B Align
1223 is the alignment specification for the cell.
1224 .B Flags
1225 is some combination of
1226 .BR TFparsing ,
1227 .B TFnowrap
1228 and
1229 .B TFisth
1230 or'd together.
1231 Here
1232 .B TFparsing
1233 is used internally by the parser, and should be ignored.
1234 .B TFnowrap
1235 means that the contents of the cell should not be
1236 wrapped if they don't fit the available width,
1237 rather, the table should be expanded if need be
1238 (this is set when the nowrap attribute is supplied).
1239 .B TFisth
1240 means that the cell was created by the
1241 .B <th>
1242 element (rather than the
1243 .B <td>
1244 element),
1245 indicating that it is a header cell rather than a data cell.
1246 .B Wspec
1247 provides a suggested width as a dimension specification,
1248 and
1249 .B hspec
1250 provides a suggested height in pixels.
1251 .B Background
1252 gives a background specification for the individual cell.
1253 .BR Minw ,
1254 .BR maxw ,
1255 .B ascent
1256 and
1257 .B pos
1258 are reserved for use by the caller during layout.
1259 .B Row
1260 and
1261 .B col
1262 give the indices of the row and column of the top left-hand
1263 corner of the cell within the table grid.
1264 .SS Client-side Maps
1265 .PP
1266 The library builds a list of client-side maps, headed by
1267 .BR Docinfo.maps ,
1268 and having the following structure:
1269 .PP
1270 .EX
1271 .ta 6n +\w'Rune* 'u
1272 typedef struct Map Map;
1273 struct Map
1275 Map* next;
1276 Rune* name;
1277 Area* areas;
1279 .EE
1280 .PP
1281 .B Next
1282 points to the next element in the list,
1283 .B name
1284 is the name of the map (use to bind it to an image), and
1285 .B areas
1286 is a list of the areas within the image that comprise the map,
1287 using the following structure:
1288 .PP
1289 .EX
1290 .ta 6n +\w'Dimen* 'u
1291 typedef struct Area Area;
1292 struct Area
1294 Area* next;
1295 int shape;
1296 Rune* href;
1297 int target;
1298 Dimen* coords;
1299 int ncoords;
1301 .EE
1302 .PP
1303 .B Next
1304 points to the next element in the map's list of areas.
1305 .B Shape
1306 describes the shape of the area, and is one of
1307 .BR SHrect ,
1308 .B SHcircle
1310 .BR SHpoly .
1311 .B Href
1312 is the URL associated with this area in its role as
1313 a hypertext link, and
1314 .B target
1315 is the target frame it should be loaded in.
1316 .B Coords
1317 is an array of coordinates for the shape, and
1318 .B ncoords
1319 is the size of this array (number of elements).
1320 .SS Frames
1321 .PP
1322 If the
1323 .B Docinfo.kidinfo
1324 field is set, the document is a frameset.
1325 In this case, it is typical for
1326 .I parsehtml
1327 to return nil, as a document which is a frameset should have no actual
1328 items that need to be laid out (such will appear only in subsidiary documents).
1329 It is possible that items will be returned by a malformed document; the caller
1330 should check for this and free any such items.
1331 .PP
1332 The
1333 .B Kidinfo
1334 structure itself reflects the fact that framesets can be nested within a document.
1335 If is defined as follows:
1336 .PP
1337 .EX
1338 .ta 6n +\w'Kidinfo* 'u
1339 typedef struct Kidinfo Kidinfo;
1340 struct Kidinfo
1342 Kidinfo* next;
1343 int isframeset;
1345 // fields for "frame"
1346 Rune* src;
1347 Rune* name;
1348 int marginw;
1349 int marginh;
1350 int framebd;
1351 int flags;
1353 // fields for "frameset"
1354 Dimen* rows;
1355 int nrows;
1356 Dimen* cols;
1357 int ncols;
1358 Kidinfo* kidinfos;
1359 Kidinfo* nextframeset;
1361 .EE
1362 .PP
1363 .B Next
1364 is only used if this structure is part of a containing frameset; it points to the next
1365 element in the list of children of that frameset.
1366 .B Isframeset
1367 is set when this structure represents a frameset; if clear, it is an individual frame.
1368 .PP
1369 Some fields are used only for framesets.
1370 .B Rows
1371 is an array of dimension specifications for rows in the frameset, and
1372 .B nrows
1373 is the length of this array.
1374 .B Cols
1375 is the corresponding array for columns, of length
1376 .BR ncols .
1377 .B Kidinfos
1378 points to a list of components contained within this frameset, each
1379 of which may be a frameset or a frame.
1380 .B Nextframeset
1381 is only used during parsing, and should be ignored.
1382 .PP
1383 The remaining fields are used if the structure describes a frame, not a frameset.
1384 .B Src
1385 provides the URL for the document that should be initially loaded into this frame.
1386 Note that this may be a relative URL, in which case it should be interpretted
1387 using the containing document's URL as the base.
1388 .B Name
1389 gives the name of the frame, typically supplied via a name attribute in the HTML.
1390 If no name was given, the library allocates one.
1391 .BR Marginw ,
1392 .B marginh
1393 and
1394 .B framebd
1395 are the values of the marginwidth, marginheight and frameborder attributes, respectively.
1396 .B Flags
1397 can contain some combination of the following:
1398 .B FRnoresize
1399 (the frame had the noresize attribute set, and the user should not be allowed to resize it),
1400 .B FRnoscroll
1401 (the frame should not have any scroll bars),
1402 .B FRhscroll
1403 (the frame should have a horizontal scroll bar),
1404 .B FRvscroll
1405 (the frame should have a vertical scroll bar),
1406 .B FRhscrollauto
1407 (the frame should be automatically given a horizontal scroll bar if its contents
1408 would not otherwise fit), and
1409 .B FRvscrollauto
1410 (the frame gets a vertical scrollbar only if required).
1411 .SH SOURCE
1412 .B \*9/src/libhtml
1413 .SH SEE ALSO
1414 .MR fmt (1)
1415 .PP
1416 W3C World Wide Web Consortium,
1417 ``HTML 4.01 Specification''.
1418 .SH BUGS
1419 The entire HTML document must be loaded into memory before
1420 any of it can be parsed.