—
June 2026
When creating a graphical application on linux, one option is to use Xlib aka libX11 to set up the window. I’ve written code like this in the past, usually with the excellent luigi library as a reference. The resulting code isn’t nice, but at the end of the day I don’t know if I can say it is particularly much worse than doing the same with the win32 API.
Last month I got curious what it would take to ditch Xlib/libX11 and implement the client side of the X11 protocol directly in my application. Probably not a good idea, but hey, worst case I’ll learn something.
This post is a collection of notes from me trying to write a X11 client from scratch.
My jumping-off point was this blog post on opening a window in X11 from assembly. I was using C, but the blog post showed enough of the basics of how X11 operates.
From there I needed to go looking for additional X11 documentation. I eventually found this documentation overview page. The overview page contains links to the X Window System Protocol, which describes most of the protocol. Initially I struggled with reading the document, since there is a lot going on. Having the X11-in-assembly blog post to guide me to the first steps was very helpful and eventually I became comfortable reading the format of the documentation and was able to figure out things on my own.
The first big hurdle I ran into was authorization. This is a
song-and-dance where I, the client, have to read a
.Xauthority file and pass some of the contents to the
server to prove that I’m authorized to access the server. The whole
thing seems silly. The client-server connection uses unix domain
sockets, which can provide access control either via file
permissions on the socket or via SO_PEERCRED. I assume that there either
is some legacy reason for still using .Xauthority, or I’m
missing something about why it is used (maybe related to X11 forwarding
in SSH?). At least Xwayland enforces .Xauthority, so I had
to implement this.
The format of .Xauthority is not documented in any of
the actual X11 documentation (at least not that I could find). The only
“official” description I found was in the README for libxau.
As a sidenote, the way libX11 is structured is that libX11 depends on libxcb (which is a lower-level X11 client library), and libxcb in turn depends on libxau.
This is where I first got the feeling of “is this really a protocol?”: Parts of what a client needs to do are only documented in the one library which the main client implementation uses.
Either way, with authorization handled I was now able to open a blank window.
I had previously heard that if you don’t use libX11 you effectively also cannot use graphics APIs.
The reason for this is that the graphics API and window API have to
tie together somehow. E.g. in the case of OpenGL via EGL this is done
via eglCreatePlatformWindowSurface (see this
example). But that function takes a libX11 Window * as a
parameter. Furthermore mesa, which contains one (the most commonly
used?) implementation of OpenGL and EGL, is using libxcb internally to
handle the Window * which got passed to
eglCreatePlatformWindowSurface.
So even if I were to implement the client side of the X11 protocol in
my application and create something which looks like a libX11
Window * (which is fine since libX11 Window is
just a uint32_t identifier, and I’ll have the equivalent
identifier in my code), under the hood I’d still be using libX11 (or
rather libxcb) because the graphics driver uses it.
So I guess I could use graphics APIs, but since libxcb would be dynamically loaded at that point I might as well have used it.
Ok, fine. But at least I can do software rendering. Writing software rendering code is good fun too, so it’s not all bad.
X11 includes a PutImage command for sending pixels to the server.
I ran into two hurdles when trying to use PutImage:
First, the data being submitted has to be in a format accepted by the
server. The X11 server lists a number of formats (“visuals”) which it
accepts. For now I’ve ignored this, because using one
uint32_t per pixel with 8+8+8 bit RGB seems to just work,
but I think I’m technically at the mercy of the server here. The server
lists supported visuals, but I’ve not really been able to make sense of
the data yet. xdpyinfo prints out the visuals supported by
the server in a readable format, but on my machine (using Xwayland) I
have >200 visuals, most of which contain duplicates of the same data
(class = TrueColor or DirectColor, depth = 24 or 32.) over and over. No
idea what I’m supposed to do with this.
So I’ve ignored the first hurdle. On to the second one:
X11 limits how many bytes can be put in one command. The limit is
specified by the server in maximum-request-length when the
client connects. This field is a uint16_t and counts number
of DWORDS, i.e. a PutImage command can contain at most
4*0xffff bytes, minus some protocol overhead. That means a
1920x1080 window requires 32 separate PutImage commands to fully update
the window. This is sort-of solved by the big
requests extension, which the server can optionally support. It’s
another song-and-dance to set up, but then my Xwayland version allows up
to ~16 MB of data in one request, enough for a full 1920x1080 window. I
think the proper way of working around this is to use the MIT-SHM
extension which allows putting the backbuffer in shared memory instead
of sending pixel data over a socket. I’ve not tried to use this yet.
Additionally probably the XPresent extension is useful (it allows doing vsync?), but this is another extension which I don’t think is documented except for via the implementation (libxpresent).
None of this is the end of the world. But I’m starting to have a worrying amount of code in my application. And perhaps more worryingly, a lot of this is “at the mercy of the server” style code: If the server supports big requests / MIT-SHM / XPresent I want to use them, but if not I in theory need to include fallback code (or crash :^)). And maybe worse, if the server doesn’t support the “visual” I want to use I need to include code to convert my backbuffer to a different pixel format.
There is some keyboard input stuff defined in the base specification. But after I had implemented support for that, I realized it didn’t handle anything except the US keyboard layout. I don’t use the US layout and sometimes I switch between different keyboard layouts and I wanted that to work properly.
Enter the X Keyboard Extension (XKB).
I ended up writing ~500 lines of code to deal with XKB (but I still haven’t added code for dead keys…). Overall it is more code than I would have liked (but that is true for all of this project), but the code is fairly straight forward.
The code boils down to:
At first, I was overwhelmed by the amount of information in the keyboard map which the X11 server wanted to give me: Key types, key syms, modifier map, explicit components, key actions, etc. But at some point I realized that I only need key types and key syms to do event translation on the client side. All the other information is only used by the server. It is just exposed by the server because the client can not only get the keyboard map, but also set it in order to change keyboard configuration (though probably the only application doing that should be the keyboard config GUI shipped by the desktop environment?).
The client library documentation for XKB contains an overview of the parts that the client application needs to do.
A lookup in the keyboard map converts a key event into a
KEYSYM. This can be a letter encoded in unicode, a
“function key” (there is a
table of all function keys), or a letter encoded in some legacy
encoding. I guess the legacy encodings are to be expected since X11 is
an old protocol. The base specification helpfully provides a
table mapping codepoints in the legacy encodings to the
corresponding unicode codepoint. But it is a bit unfortunate that the
client has to include this translation table (my normal keyboard layout
seems to produce legacy encodings for non-ascii letters, so I really
can’t skip having this table).
One thing I was wondering is why the server can’t do the key event to
KEYSYM translation. The server has all the information that
is needed and it already resolves/handles a lot of the parts of the
keyboard map, just leaving the key types and key syms tables for the
client to deal with. Why not go all the way? Maybe I’m missing
something, but I suspect it’s a matter of “this is how the spec was
defined 40 years ago”.
I don’t have a good conclusion. I learnt a bunch about how X11 works and I hope I can transfer the knowledge to other future work. I’m unsure whether I’ll continue working on my X11-without-Xlib code or whether I’ll leave it, but I had some fun writing the code either way.
There are a lot of other things I haven’t talked about
(e.g. apparently the format for the DISPLAY environment variable which
Xlib parses isn’t documented, and just defined by whatever
strchr/strrchr-spaghetti they have in the
library?). And also many things which I would have been curious to read
more about.
The title, X11 is not a protocol, comes from the feeling I got while working on this project that it isn’t sensible to write a X11 client without using Xlib or xcb. And even if it were feasible, a lot of things seemingly are only specified through the documentation of the libraries (as opposed to protocol documentation) or even de-facto by what is implemented in the libraries.
I wonder whether a protocol was ever something that was intended to come out of X11, or whether the fact that there is a protocol is just incidental to the fact that it is a server-client architecture?
Next up maybe I’ll try doing the same for wayland. Surely that will be fun.