Markdown spec

Introduction

md-toc aimes to be as conformant as possible to each supported markdown parser. What follows is a list of parameters and rules used by md-toc to decide how to parse markdown files and to generate the table of contents.

Compatibility table

Key

Note

This color system is subjective.

Color

Meaning

unknown

unknown

none

none

low

low

partial

partial

most

most

full

full

Status

Parser

Status

Alias of

Supported parser version

Source

cmark

most

Version 0.29 (2019-04-06)

https://github.com/commonmark/cmark

commonmarker

most

github

https://github.com/gjtorikian/commonmarker

github

most

Version 0.29-gfm (2019-04-06)

https://github.com/github/cmark

goldmark

most

cmark

https://github.com/yuin/goldmark

gitlab

partial

Latest unknown version

https://docs.gitlab.com/ee/user/markdown.html

redcarpet

low

Redcarpet v3.5.0

https://github.com/vmg/redcarpet

Gogs

unknown

https://gogs.io/

NotABug Gogs fork

unknown

https://notabug.org/hp/gogs/

Marked

unknown

https://github.com/markedjs/marked

kramdown

unknown

https://kramdown.gettalong.org/

GitLab Kramdown

unknown

https://gitlab.com/gitlab-org/gitlab_kramdown

Notabug

unknown

https://notabug.org/hp/gogs/

md-toc version table

Key

Word

Meaning

L

latest version available at the time. Implies U.

N

not implemented

U

unavailable version number

C

Commonmark

R

Redcarpet

G

GitLab modified Redcarpet

Status history

md-toc

standard

cmark

commonmarker

github

gitlab

goldmark

redcarpet

0.0.1

N

N

N

U

N

N

N

1.0.0

U

N

N

L

G

N

https://github.com/vmg/redcarpet/tree/26c80f05e774b31cd01255b0fa62e883ac185bf3

2.0.0

N

L

L

C

G

N

https://github.com/vmg/redcarpet/tree/e3a1d0b00a77fa4e2d3c37322bea66b82085486f

2.0.1

N

L

L

C

G

N

https://github.com/vmg/redcarpet/tree/e3a1d0b00a77fa4e2d3c37322bea66b82085486f

3.0.0

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

3.1.0

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

4.0.0

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

5.0.0

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

5.0.1

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

6.0.0

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

6.0.1

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

6.0.2

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

7.0.0

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

7.0.1

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

7.0.2

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

7.0.3

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

7.0.4

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

7.0.5

N

github

github

0.28-gfm

github

N

https://github.com/vmg/redcarpet/tree/94f6e27bdf2395efa555a7c772a3d8b70fb84346

7.1.0

N

github

github

0.28-gfm

github

N

v3.5.0

7.2.0

N

0.28 1

0.28-gfm

0.28-gfm

github

N

v3.5.0

8.0.0

N

0.29

github

0.29-gfm

L

cmark

v3.5.0

8.0.1

N

0.29

github

0.29-gfm

L

cmark

v3.5.0

8.1.0

N

0.29

github

0.29-gfm

L

cmark

v3.5.0

1

used alias github

Supported markdown parsers

  • cmark:

    • “CommonMark parsing and rendering library and program in C”.

  • commonmarker:

    • a “Ruby wrapper for libcmark (CommonMark parser)”.

    • as described on their website: “It also includes extensions to the CommonMark spec as documented in the GitHub Flavored Markdown spec, such as support for tables, strikethroughs, and autolinking.”. For this reason we assume that commonmarker is an alias of github.

  • github:

    • uses a forked version of cmark with some added extensions. This language specification is called GitHub Flavored Markdown.

    • there are subtle differences such as the disallowed raw HTML extension which affects md-toc.

  • gitlab:

  • goldmark:

    • this parser claims to be compliant with CommonMark: goldmark is compliant with CommonMark 0.29.. For this reason goldmark is an alias of cmark.

  • redcarpet:

    • “The safe Markdown parser, reloaded.”

Rules

Headers

Only ATX-style headings are supported in md-toc.

List item rules

Problems

We are interested in sublists indentation rules for all types of lists, and integer overflows in case of ordered lists.

For ordered lists, we are not concerned about using 0 or negative numbers as list markers so these cases will not be considered. Infact ordred lists generated by md-toc will always start from 1.

Talking about indentation rules, I need to mention that the user is responsible for generating a correct markdown list according to the parser’s rules. Let’s see this example:

# foo
## bar
### baz

no problem here because this is rendered by md-toc, using github as parser, with:

- [foo](#foo)
  - [bar](#bar)
    - [baz](#baz)

Now, let’s take the previous example and reverse the order of the lines:

### baz
## bar
# foo

and this is what md-toc renders using github:

- [baz](#baz)
- [foo](#foo)
- [bar](#bar)

while the user might expect this:

    - [baz](#baz)
  - [foo](#foo)
- [bar](#bar)

Indentation

  • cmark, github, gitlab: list indentation for sublists with this parser is based on the previous state, as stated in the GitHub Flavored Markdown document, at section 5.2:

    “The most important thing to notice is that the position of the text after the list marker determines how much indentation is needed in subsequent blocks in the list item. If the list marker takes up two spaces, and there are three spaces between the list marker and the next non-whitespace character, then blocks must be indented five spaces in order to fall under the list item.”

    This is also true with the specular case: if our new list element needs less indentation than the one processed currently, we have to use the same number of indentation spaces used somewhere earlier in the list.

  • redcarpet:

    The following C function returns the first non-whitespace character after the list marker. The value of 0 is returned if the input line is not a list element. List item rules are explained in the single line comments.

    /* prefix_uli • returns unordered list item prefix */
    static size_t
    prefix_uli(uint8_t *data, size_t size)
    {
        size_t i = 0;
    
        // There can be up to 3 whitespaces before the list marker.
        if (i < size && data[i] == ' ') i++;
        if (i < size && data[i] == ' ') i++;
        if (i < size && data[i] == ' ') i++;
    
        // The next non-whitespace character must be a list marker and
        // the character after the list marker must be a whitespace.
        if (i + 1 >= size ||
           (data[i] != '*' && data[i] != '+' && data[i] != '-') ||
            data[i + 1] != ' ')
            return 0;
    
        // Check that the next line is not a header
        // that uses the `-` or `=` characters as markers.
        if (is_next_headerline(data + i, size - i))
            return 0;
    
        // Return the first non whitespace character after the list marker.
        return i + 2;
    }
    

    As far as I can tell from the previous and other functions, on a new list block the 4 spaces indentation rule applies:

    This means that anything that has more than 3 whitespaces is considered as sublist. The only exception seems to be for the first sublist in a list block, in which that case even a single whitespace counts as a sublist. The 4 spaces indentation rule appllies nontheless, so to keep things simple md-toc will always use 4 whitespaces for sublists. Apparently, ordered and unordered lists share the same proprieties.

    Let’s see this example:

    - I
     - am
         - foo
    
    stop
    
    - I
        - am
            - foo
    

    This is how redcarpet renders it once you run $ redcarpet:

    <ul>
    <li>I
    
    <ul>
    <li>am
    
    <ul>
    <li>foo</li>
    </ul></li>
    </ul></li>
    </ul>
    
    <p>stop</p>
    
    <ul>
    <li>I
    
    <ul>
    <li>am
    
    <ul>
    <li>foo</li>
    </ul></li>
    </ul></li>
    </ul>
    

    What follows is an extract of a C function in redcarpet that parses list items. I have added all the single line comments.

    /* parse_listitem • parsing of a single list item */
    /*  assuming initial prefix is already removed */
    static size_t
    parse_listitem(struct buf *ob, struct sd_markdown *rndr, uint8_t *data,
    size_t size, int *flags)
    {
        struct buf *work = 0, *inter = 0;
        size_t beg = 0, end, pre, sublist = 0, orgpre = 0, i;
        int in_empty = 0, has_inside_empty = 0, in_fence = 0;
    
        // This is the base case, usually of indentation 0 but it can be
        // from 0 to 3 spaces. If it was 4 spaces it would be a code
        // block.
        /* keeping track of the first indentation prefix */
        while (orgpre < 3 && orgpre < size && data[orgpre] == ' ')
            orgpre++;
    
        // Get the first index of string after the list marker. Try both
        // ordered and unordered lists
        beg = prefix_uli(data, size);
        if (!beg)
            beg = prefix_oli(data, size);
    
        if (!beg)
            return 0;
    
        /* skipping to the beginning of the following line */
        end = beg;
        while (end < size && data[end - 1] != '\n')
            end++;
    
        // Iterate line by line using the '\n' character as delimiter.
        /* process the following lines */
        while (beg < size) {
            size_t has_next_uli = 0, has_next_oli = 0;
    
            // Go to the next line.
            end++;
    
            // Find the end of the line.
            while (end < size && data[end - 1] != '\n')
                end++;
    
            // Skip the next line if it is empty.
            /* process an empty line */
            if (is_empty(data + beg, end - beg)) {
                in_empty = 1;
                beg = end;
                continue;
            }
    
            // Count up to 4 characters of indentation.
            // If we have 4 characters then it might be a sublist.
            // Note that this is an offset and does not point to an
            // index in the actual line string.
            /* calculating the indentation */
            i = 0;
            while (i < 4 && beg + i < end && data[beg + i] == ' ')
                i++;
    
            pre = i;
    
            /* Only check for new list items if we are **not** inside
             * a fenced code block */
             if (!in_fence) {
               has_next_uli = prefix_uli(data + beg + i, end - beg - i);
               has_next_oli = prefix_oli(data + beg + i, end - beg - i);
            }
    
            /* checking for ul/ol switch */
            if (in_empty && (
                ((*flags & MKD_LIST_ORDERED) && has_next_uli) ||
                (!(*flags & MKD_LIST_ORDERED) && has_next_oli))){
                *flags |= MKD_LI_END;
                break; /* the following item must have same list type */
            }
    
            // Determine if we are dealing with:
            // - an empty line
            // - a new list item
            // - a sublist
            /* checking for a new item */
            if ((has_next_uli && !is_hrule(data + beg + i, end - beg - i)) || has_next_oli) {
                if (in_empty)
                    has_inside_empty = 1;
    
                // The next list item's indentation (pre) must be the same as
                // the previous one (orgpre), otherwise it might be a
                // sublist.
                if (pre == orgpre) /* the following item must have */
                    break;             /* the same indentation */
    
                // If the indentation does not match the previous one then
                // assume that it is a sublist. Check later whether it is
                // or not.
                if (!sublist)
                    sublist = work->size;
            }
            /* joining only indented stuff after empty lines */
            else if (in_empty && i < 4 && data[beg] != '\t') {
                *flags |= MKD_LI_END;
                break;
            }
            else if (in_empty) {
                // Add a line delimiter to the next line if it is missing.
                bufputc(work, '\n');
                has_inside_empty = 1;
            }
    
            in_empty = 0;
            beg = end;
        }
    
        if (*flags & MKD_LI_BLOCK) {
            /* intermediate render of block li */
            if (sublist && sublist < work->size) {
                parse_block(inter, rndr, work->data, sublist);
                parse_block(inter, rndr, work->data + sublist, work->size - sublist);
        }
        else
            parse_block(inter, rndr, work->data, work->size);
    }
    

    According to the code, parse_listitem is called indirectly by parse_block (via parse_list), but parse_block is called directly by parse_listitem so the code analysis is not trivial. For this reason I might be mistaken about the 4 spaces indentation rule.

    Here is an extract of the parse_block function with the calls to parse_list:

    /* parse_block • parsing of one block, returning next uint8_t to parse */
    static void
    parse_block(struct buf *ob, struct sd_markdown *rndr, uint8_t *data, size_t
    size)
    {
        while (beg < size) {
    
            else if (prefix_uli(txt_data, end))
              beg += parse_list(ob, rndr, txt_data, end, 0);
    
            else if (prefix_oli(txt_data, end))
              beg += parse_list(ob, rndr, txt_data, end, MKD_LIST_ORDERED);
        }
    }
    

Overflows

Notes on ordered lists

Code fence

Code fences are sections of a markdown document where some parsers treat the text within them as verbatim. Usually the purpose of these sections is to display source code. Some programming languages use the character # as a way to comment a line in the code. For this reason md-toc needs to ignore code fences in order not to treat the # character as an ATX-style heading and thus get parsed as an element of the TOC.

TOC marker

A TOC marker is a string that marks that the start and the end of the table of contents in a markdown file.

By default it was decided to use [](TOC) as the default TOC marker because it would result invisible in some markdown parsers. In other cases, however, such as the one used by Gitea, that particular TOC marker was still visible. HTML comments seem to be a better solution.

Other markdown parsers

If you have a look at https://www.w3.org/community/markdown/wiki/MarkdownImplementations you will see that there are a ton of different markdown parsers out there. Moreover, that list has not been updated in a while.

Markdown parsers have different behaviours regarding anchor links. Some of them implement them while others don’t; some act on the duplicate entry problem while others don’t; some strip consecutive dash characters while others don’t. And it’s not just about anchor links, as you have read earlier. For example:

Steps to add an unsupported markdown parser

  1. Find the source code and/or documents.

  2. Find the rules for each section, such as anchor link generation, title detection, etc… Rely more on the source code than on the documentation (if possible)

  3. Add the relevant information on this page.

  4. Write or adapt an algorithm for that section.

  5. Write unit tests for it.

  6. Add the new parser to the CLI interface.

Curiosities