26 Item* parsehtml(uchar* data, int datalen, Rune* src, int mtype,
28 int chset, Docinfo** pdi)
31 void printitems(Item* items, char* msg)
34 int validitems(Item* items)
37 void freeitems(Item* items)
40 void freedocinfo(Docinfo* d)
43 int dimenkind(Dimen d)
46 int dimenspec(Dimen d)
52 Rune* targetname(int targid)
55 uchar* fromStr(Rune* buf, int n, int chset)
58 Rune* toStr(uchar* buf, int n, int chset)
61 This library implements a parser for HTML 4.0 documents.
62 The parsed HTML is converted into an intermediate representation that
63 describes how the formatted HTML should be laid out.
66 parses an entire HTML document contained in the buffer
70 The URL of the document should be passed in as
73 is the media type of the document, which should be either
77 The character set of the document is described in
85 The return value is a linked list of
87 structures, described in detail below.
90 is set to point to a newly created
92 structure, containing information pertaining to the entire document.
94 The library expects two allocation routines to be provided by the
99 These routines are analogous to the standard malloc and realloc routines,
100 except that they should not return if the memory allocation fails.
103 is required to zero the memory.
105 For debugging purposes,
107 may be called to display the contents of an item list; individual items may
110 print verb, installed on the first call to
113 traverses the item list, checking that all of the pointers are valid.
116 is everything is ok, and
118 if an error was found.
119 Normally, one would not call these routines directly.
120 Instead, one sets the global variable
122 and the library calls them automatically.
125 to cause the library to print a warning whenever it finds a problem with the
128 to print debugging information in the lexer.
130 When an item list is finished with, it should be freed with
134 should be called on the pointer returned in
140 are provided to interpret the
142 type, as described in the section
143 .IR "Dimension Specifications" .
145 Frame target names are mapped to integer ids via a global, permanent mapping.
146 To find the value for a given name, call
148 which allocates a new id if the name hasn't been seen before.
149 The name of a given, known id may be retrieved using
151 The library predefines
158 The library handles all text as Unicode strings (type
160 Character set conversion is provided by
167 Unicode characters from
169 and converts them to the character set described by
176 interpretted as belonging to character set
178 and converts them to a Unicode string.
179 Both routines null-terminate the result, and use
181 to allocate space for it.
185 is a linked list of variant structures,
186 with the generic portion described by the following definition:
189 .ta 6n +\w'Genattr* 'u
190 typedef struct Item Item;
206 points to the successor in the linked list of items, while
211 are intended for use by the caller as part of the layout process.
213 if non-zero, gives the integer id assigned by the parser to the anchor that
214 this item is in (see section
217 is a collection of flags and values described as follows:
220 .ta 6n +\w'IFindentshift = 'u
224 IFbrksp = 0x40000000,
225 IFnobrk = 0x20000000,
226 IFcleft = 0x10000000,
227 IFcright = 0x08000000,
230 IFrjust = 0x01000000,
231 IFcjust = 0x00800000,
234 IFindentmask = (255<<IFindentshift),
240 is set if a break is to be forced before placing this item.
242 is set if a 1 line space should be added to the break (in which case
246 is set if a break is not permitted before the item.
248 is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
249 before this item is placed, and
251 is set for right floats.
252 In both cases, IFbrk is also set.
254 is set if the line containing this item is allowed to wrap.
256 is set if this item hangs into the left indent.
258 is set if the line containing this item should be right justified,
261 is set for center justified lines.
263 is used to indicate that an image is a server-side map.
264 The low 8 bits, represented by
266 indicate the current hang into left indent, in tenths of a tabstop.
267 The next 8 bits, represented by
271 indicate the current indent in tab stops.
275 is an optional pointer to an auxiliary structure, described in the section
276 .IR "Generic Attributes" .
280 describes which variant type this item has.
281 It can have one of the values
290 For each of these values, there is an additional structure defined, which
291 includes Item as an unnamed initial substructure, and then defines additional
296 represent a piece of text, using the following structure:
313 is a null-terminated Unicode string of the actual characters making up this text item,
315 is the font number (described in the section
316 .IR "Font Numbers" ),
319 is the RGB encoded color for the text.
321 measures the vertical offset from the baseline; subtract
323 to get the actual value (negative values represent a displacement down the page).
326 is the underline style:
330 for conventional underline, and
336 represent a horizontal rule, as follows:
352 is the alignment specification (described in the corresponding section),
354 is set if the rule should not be shaded,
356 is the height of the rule (as set by the size attribute),
359 is the desired width (see section
360 .IR "Dimension Specifications" ).
364 describe embedded images, for which the following structure is defined:
367 .ta 6n +\w'Iimage* 'u
387 is the URL of the image source,
391 if non-zero, contain the specified width and height for the image,
394 is the text to use as an alternative to the image, if the image is not displayed.
396 if set, points to a structure describing an associated client-side image map.
398 is reserved for use by the application, for handling animated images.
400 encodes the alignment specification of the image.
402 contains the number of pixels to pad the image with on either side, and
404 the padding above and below.
406 is the width of the border to draw around the image.
408 points to the next image in the document (the head of this list is
409 .BR Docinfo.images ).
413 the following structure is defined:
416 .ta 6n +\w'Formfield* 'u
420 Formfield* formfield;
424 This adds a single field,
426 which points to a structure describing a field in a form, described in section
431 the following structure is defined:
443 points to a structure describing the table, described in the section
448 the following structure is defined:
451 .ta 6n +\w'Ifloat* 'u
466 points to a single item (either a table or an image) that floats (the text of the
467 document flows around it), and
469 indicates the margin that this float sticks to; it is either
476 are reserved for use by the caller; these are typically used for the coordinates
477 of the top of the float.
479 is used by the caller to keep track of whether it has placed the float.
481 is used by the caller to link together all of the floats that it has placed.
485 the following structure is defined:
497 encodes the kind of spacer, and may be one of
499 (zero height and width),
501 (takes on height and ascent of the current font),
503 (has the width of a space in the current font) and
505 (for all other purposes, such as between markers and lists).
506 .SS Generic Attributes
508 The genattr field of an item, if non-nil, points to a structure that holds
509 the values of attributes not specific to any particular
510 item type, as they occur on a wide variety of underlying HTML tags.
511 The structure is as follows:
514 .ta 6n +\w'SEvent* 'u
515 typedef struct Genattr Genattr;
532 when non-nil, contain values of correspondingly named attributes of the HTML tag
533 associated with this item.
535 is a linked list of events (with corresponding scripted actions) associated with the item:
538 .ta 6n +\w'SEvent* 'u
539 typedef struct SEvent SEvent;
550 points to the next event in the list,
573 is the text of the associated script.
574 .SS Dimension Specifications
576 Some structures include a dimension specification, used where
577 a number can be followed by a
582 percentage of total or relative weight.
583 This is encoded using the following structure:
587 typedef struct Dimen Dimen;
594 Separate kind and spec values are extracted using
606 means that no dimension was specified.
609 should be called to find the absolute number of pixels, the percentage of total,
610 or the relative weight.
611 .SS Background Specifications
613 It is possible to set the background of the entire document, and also
614 for some parts of the document (such as tables).
615 This is encoded as follows:
619 typedef struct Background Background;
628 if non-nil, is the URL of an image to use as the background.
631 is used instead, as the RGB value for a solid fill color.
632 .SS Alignment Specifications
634 Certain items have alignment specifiers taken from the following
641 ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
642 ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
646 These values correspond to the various alignment types named in the HTML 4.0
648 If an item has an alignment of
652 the library automatically encapsulates it inside a float item.
654 Tables, and the various rows, columns and cells within them, have a more
655 complex alignment specification, composed of separate vertical and
656 horizontal alignments:
660 typedef struct Align Align;
687 Text items have an associated font number (the
689 field), which is encoded as
690 .BR style*NumSize+size .
699 for roman, italic, bold and typewriter font styles, respectively, and size is
706 The total number of possible font numbers is
708 and the default font number is
710 (which is roman style, normal size).
713 Global information about an HTML page is stored in the following structure:
716 .ta 6n +\w'DestAnchor* 'u
717 typedef struct Docinfo Docinfo;
720 // stuff from HTTP headers, doc head, and body tag
724 Background background;
725 Iimage* backgrounditem;
739 // info needed to respond to user actions
750 gives the URL of the original source of the document,
755 is the document's title, as set by a
759 is as described in the section
760 .IR "Background Specifications" ,
763 is set to be an image item for the document's background image (if given as a URL),
766 gives the default foregound text color of the document,
768 the unvisited hyperlink color,
770 the visited hyperlink color, and
772 the color for highlighting hyperlinks (all in 24-bit RGB format).
774 is the default target frame id.
785 is the type of any scripts contained in the document, and is always
788 is set if the document contains any scripts.
789 Scripting is currently unsupported.
792 .B "<meta http-equiv=Refresh ...>"
795 is set if this document is a frameset (see section
798 is this document's frame id.
801 is a list of hyperlinks contained in the document,
804 is a list of hyperlink destinations within the page (see the following section for details).
809 are lists of the various forms, tables and client-side maps contained
810 in the document, as described in subsequent sections.
812 is a list of all the image items in the document.
815 The library builds two lists for all of the
817 elements (anchors) in a document.
818 Each anchor is assigned a unique anchor id within the document.
819 For anchors which are hyperlinks (the
821 attribute was supplied), the following structure is defined:
824 .ta 6n +\w'Anchor* 'u
825 typedef struct Anchor Anchor;
837 points to the next anchor in the list (the head of this list is
838 .BR Docinfo.anchors ).
840 is the anchor id; each item within this hyperlink is tagged with this value
847 are the values of the correspondingly named attributes of the anchor
848 (in particular, href is the URL to go to).
850 is the value of the target attribute (if provided) converted to a frame id.
852 Destinations within the document (anchors with the name attribute set)
855 list, using the following structure:
858 .ta 6n +\w'DestAnchor* 'u
859 typedef struct DestAnchor DestAnchor;
870 is the next element of the list,
874 is the value of the name attribute, and
876 is points to the item within the parsed document that should be considered
877 to be the destination.
880 Any forms within a document are kept in a list, headed by
882 The elements of this list are as follows:
885 .ta 6n +\w'Formfield* 'u
886 typedef struct Form Form;
901 points to the next form in the list.
903 is a serial number for the form within the document.
905 is the value of the form's name or id attribute.
907 is the value of any action attribute.
909 is the value of the target attribute (if any) converted to a frame target id.
916 is the number of fields in the form, and
918 is a linked list of the actual fields.
920 The individual fields in a form are described by the following structure:
923 .ta 6n +\w'Formfield* 'u
924 typedef struct Formfield Formfield;
947 points to the next field in the list.
949 is the type of the field, which can be one of
964 is a serial number for the field within the form.
966 points back to the form containing this field.
974 each contain the values of corresponding attributes of the field, if present.
976 contains per-field flags, of which
982 is only used for fields of type
984 it points to an image item containing the image to be displayed.
986 is reserved for use by the caller, typically to store a unique id
987 of an associated control used to implement the field.
989 is the same as the corresponding field of the generic attributes
990 associated with the item containing this field.
992 is only used by fields of type
994 it consists of a list of possible options that may be selected for that
995 field, using the following structure:
998 .ta 6n +\w'Option* 'u
999 typedef struct Option Option;
1010 points to the next element of the list.
1012 is set if this option is to be displayed initially.
1014 is the value to send when the form is submitted if this option is selected.
1016 is the string to display on the screen for this option.
1019 The library builds a list of all the tables in the document,
1021 .BR Docinfo.tables .
1022 Each element of this list has the following format:
1025 .ta 6n +\w'Tablecell*** 'u
1026 typedef struct Table Table;
1043 Background background;
1045 uchar caption_place;
1057 points to the next element in the list of tables.
1059 is a serial number for the table within the document.
1061 is an array of row specifications (described below) and
1063 is the number of elements in this array.
1066 is an array of column specifications, and
1068 the size of this array.
1070 is a list of all cells within the table (structure described below)
1073 is the number of elements in this list.
1074 Note that a cell may span multiple rows and/or columns, thus
1079 is a two-dimensional array of cells within the table; the cell
1085 .BR Table.grid[i][j] .
1086 A cell that spans multiple rows and/or columns will
1089 multiple times, however it will only occur once in
1092 gives the alignment specification for the entire table,
1095 gives the requested width as a dimension specification.
1100 give the values of the corresponding attributes for the table,
1103 gives the requested background for the table.
1105 is a linked list of items to be displayed as the caption of the
1106 table, either above or below depending on whether
1112 Most of the remaining fields are reserved for use by the caller,
1115 which is reserved for internal use.
1118 is not defined by the library; the caller can provide its
1123 structure is defined for use by the caller.
1124 The library ensures that the correct number of these
1125 is allocated, but leaves them blank.
1126 The fields are as follows:
1130 typedef struct Tablecol Tablecol;
1139 The rows in the table are specified as follows:
1142 .ta 6n +\w'Background 'u
1143 typedef struct Tablerow Tablerow;
1151 Background background;
1158 is only used during parsing; it should be ignored by the caller.
1160 provides a list of all the cells in a row, linked through their
1167 are reserved for use by the caller.
1169 is the alignment specification for the row, and
1171 is the background to use, if specified.
1173 is used by the parser; ignore this field.
1175 The individual cells of the table are described as follows:
1178 .ta 6n +\w'Background 'u
1179 typedef struct Tablecell Tablecell;
1183 Tablecell* nextinrow;
1193 Background background;
1204 is used to link together the list of all cells within a table
1205 .RB ( Table.cells ),
1208 is used to link together all the cells within a single row
1209 .RB ( Tablerow.cells ).
1211 provides a serial number for the cell within the table.
1213 is a linked list of the items to be laid out within the cell.
1215 is reserved for the user to describe how these items have
1220 are the number of rows and columns spanned by this cell,
1223 is the alignment specification for the cell.
1225 is some combination of
1233 is used internally by the parser, and should be ignored.
1235 means that the contents of the cell should not be
1236 wrapped if they don't fit the available width,
1237 rather, the table should be expanded if need be
1238 (this is set when the nowrap attribute is supplied).
1240 means that the cell was created by the
1242 element (rather than the
1245 indicating that it is a header cell rather than a data cell.
1247 provides a suggested width as a dimension specification,
1250 provides a suggested height in pixels.
1252 gives a background specification for the individual cell.
1258 are reserved for use by the caller during layout.
1262 give the indices of the row and column of the top left-hand
1263 corner of the cell within the table grid.
1264 .SS Client-side Maps
1266 The library builds a list of client-side maps, headed by
1268 and having the following structure:
1272 typedef struct Map Map;
1282 points to the next element in the list,
1284 is the name of the map (use to bind it to an image), and
1286 is a list of the areas within the image that comprise the map,
1287 using the following structure:
1290 .ta 6n +\w'Dimen* 'u
1291 typedef struct Area Area;
1304 points to the next element in the map's list of areas.
1306 describes the shape of the area, and is one of
1312 is the URL associated with this area in its role as
1313 a hypertext link, and
1315 is the target frame it should be loaded in.
1317 is an array of coordinates for the shape, and
1319 is the size of this array (number of elements).
1324 field is set, the document is a frameset.
1325 In this case, it is typical for
1327 to return nil, as a document which is a frameset should have no actual
1328 items that need to be laid out (such will appear only in subsidiary documents).
1329 It is possible that items will be returned by a malformed document; the caller
1330 should check for this and free any such items.
1334 structure itself reflects the fact that framesets can be nested within a document.
1335 If is defined as follows:
1338 .ta 6n +\w'Kidinfo* 'u
1339 typedef struct Kidinfo Kidinfo;
1345 // fields for "frame"
1353 // fields for "frameset"
1359 Kidinfo* nextframeset;
1364 is only used if this structure is part of a containing frameset; it points to the next
1365 element in the list of children of that frameset.
1367 is set when this structure represents a frameset; if clear, it is an individual frame.
1369 Some fields are used only for framesets.
1371 is an array of dimension specifications for rows in the frameset, and
1373 is the length of this array.
1375 is the corresponding array for columns, of length
1378 points to a list of components contained within this frameset, each
1379 of which may be a frameset or a frame.
1381 is only used during parsing, and should be ignored.
1383 The remaining fields are used if the structure describes a frame, not a frameset.
1385 provides the URL for the document that should be initially loaded into this frame.
1386 Note that this may be a relative URL, in which case it should be interpretted
1387 using the containing document's URL as the base.
1389 gives the name of the frame, typically supplied via a name attribute in the HTML.
1390 If no name was given, the library allocates one.
1395 are the values of the marginwidth, marginheight and frameborder attributes, respectively.
1397 can contain some combination of the following:
1399 (the frame had the noresize attribute set, and the user should not be allowed to resize it),
1401 (the frame should not have any scroll bars),
1403 (the frame should have a horizontal scroll bar),
1405 (the frame should have a vertical scroll bar),
1407 (the frame should be automatically given a horizontal scroll bar if its contents
1408 would not otherwise fit), and
1410 (the frame gets a vertical scrollbar only if required).
1412 .B /usr/local/plan9/src/libhtml
1416 W3C World Wide Web Consortium,
1417 ``HTML 4.01 Specification''.
1419 The entire HTML document must be loaded into memory before
1420 any of it can be parsed.