I will soon be writing a spec for a semantic markup language (likely a variant on HTML, but with hardly any presentation elements). The first use will be for a social media system, but it should be applicable to documents in general.

What prior art should I know about?

What else do I need to know, in order to not fuck this up?

(Boosts appreciated!)

At minimum, I need to look through the latest snapshot of the HTML5, ignoring all the interactive and presentation elements.

I wonder if I should also look at ARIA and some other accessibility specs, but I'm not actually sure how relevant this will be.

Show thread

Mine is going to be versioned, like HTML used to be (a "living standard" is no standard at all) and I expect to go through a couple of (mostly additive) iterations, but it would be nice to get it As Right As Possible on the first pass.

It's also a good chance to fix some oddities in HTML, such as the anchor tag.

Show thread


is there a Semitic markup language, and does it flow from right to left?

@Nikolai_Kingsley You jest, but I wonder if there should be an element to delimit Unicode text directionality embeddings.

@Nikolai_Kingsley If you mean the display of LTR vs. RTL languages, Unicode seems to take care of that, and even provides characters that can define overrides and embeddings when autodetection of script is not sufficient. However, I have zero real-world experience with Hebrew, Arabic, etc. and would be curious to hear if there are places where Unicode (or Unicode + HTML) fail to properly represent them.

@varx have you looked at nroff, Groff and the other pre-html markups?

@celesteh I had not thought of those! But I also have the impression that they're almost entirely about presentation, rather than semantics. They're primarily for typesetting, right?

@alcinnz @varx Future-proofing note: The permanent location of the XHT specs is; I may move them to a different server at some point, so please use that URI.

@alexbuzzbee @alcinnz Heads-up, that link breaks because the redirect that adds a trailing slash on stomps on the URL scheme, and the server doesn't actually respond on port 80.

@alexbuzzbee Still getting a bad second redirect, unfortunately.

@varx In that case it's going to take a few hours again. I don't have the time to do that right now. I'll work on it tomorrow.

@xj9 That would be nice, but I don't think HTML5 is there yet. :-/ There are missing semantic distinctions, e.g. a figure/photo/other main-content image (such as might be displayed in a gallery UI element) vs. emoji and icons (things displayed in line with text content).

(Yes, I'm going to treat emoji as small images, rather than characters, because that's what they are.)

However, it should be straightforward to indicate how to *render* this markup to HTML5.

@x_cli Oooh, DocBook! I had not considered that.

@varx I would suggest that accessibility will probably be relevant — things like textual image/media descriptions are semantic, and having that built in early would make a lot of people in leftist software circles more likely to look at it. Talking to people who use accessibility features regularly is definitely a worthwhile endeavour, rather than assuming the reasoning is obvious to those of us who don’t require it.

@sophistoche Yeah! I don't even know how to figure out who I need to talk to, though. I suspect I need to find someone who uses accessibility features, but is also technical. (...oh, just thought of someone!)

Being semantics-focused in the first place is of course a huge step up, but I don't know how people use things like headers and other "structural" things.

Naturally, I'll have a strongly-encouraged method for providing fallback text for images and videos, but I know that's just the start.

@varx I've a lay interest in this.

HTML5 comes pretty close to this, if you look at the structural elements.

Microformats should be on your list.

LaTeX as a semantic markup is useful.

You might want to look at DocBook and both its successes and failures (AFAICT it's almost completely abandoned now, though I may be wrong).

There's an inherent conflict between simplicity and completeness. I suspect the latter is a fatal draw, and that simpler is almost always better.

@varx I'd asked a few weeks back about typography generally and got an excellent book recommendation.

There's a general sense I have of there being a (rough) hierarchy of complexity in typographical elements, much of which is semantic. Starting with characters/glyphs, whitespace (word, paragraph boundaries), punctuation, document structure, emphasis, decorations, lists, tables, nontextual elements (figures, images, multimedia), footnotes/endnotes, equations, references, citations.

@varx The book: Robert Bringhurst - The Elements of Typographic Style-Hartley & Marks, Publishers (2004)

I'll note and concede that typographic style is *not* semantic structure ...

... but it serves to illuminate semantic structure, and is useful in that regard.

Basic HTML frustrates me to no end for its failure to address equations and inter- and intra-document references _other_ than links -- notes and citations most especially. As well as metadata capture. Microformats do some of this.

@dredmorbius @varx This book looks excellent. I'm planning to buy it. Thanks!

@dredmorbius Someone mentioned DocBook earlier. It has an *intimidatingly* detailed semantic model, which I suppose speaks to your simplicity vs. completeness note.

I'll take another look at LaTeX. I've used it infrequently, and never learned much about the representation model.

@varx semantics without presentation makes me think of DITA and the older DocBook.

@varx Take a look at org mode and pandoc extended markdown.

A better question would I would say is what is not nicely covered in either of those and if so could it possibly be easily extended.

@varx What don't you like about the anchor tag in HTML? I have my own list, but I'm interested to hear your reasons.

@ceratophrys It unnecessarily conflates "hyperlink to a different place" and "destination for a sub-document hyperlink", and that second usage is redundant with just putting IDs on arbitrary elements anyhow. I'm also not happy with the name "a"/anchor, although by itself that wouldn't be reason enough to deviate from HTML.

@varx Personally I think CommonMark with the Github Flavored Markdown extensions is more than enough for a social network.

Sign in to participate in the conversation
Infosec Exchange

A Mastodon instance for info/cyber security-minded people.