X11 is not a protocol (?)

Morten Hauke Solvang

June 2026

When creating a graphical application on linux, one option is to use Xlib aka libX11 to set up the window. I’ve written code like this in the past, usually with the excellent luigi library as a reference. The resulting code isn’t nice, but at the end of the day I don’t know if I can say it is particularly much worse than doing the same with the win32 API.

Last month I got curious what it would take to ditch Xlib/libX11 and implement the client side of the X11 protocol directly in my application. Probably not a good idea, but hey, worst case I’ll learn something.

This post is a collection of notes from me trying to write a X11 client from scratch.

Getting started

My jumping-off point was this blog post on opening a window in X11 from assembly. I was using C, but the blog post showed enough of the basics of how X11 operates.

From there I needed to go looking for additional X11 documentation. I eventually found this documentation overview page. The overview page contains links to the X Window System Protocol, which describes most of the protocol. Initially I struggled with reading the document, since there is a lot going on. Having the X11-in-assembly blog post to guide me to the first steps was very helpful and eventually I became comfortable reading the format of the documentation and was able to figure out things on my own.

Authorization

The first big hurdle I ran into was authorization. This is a song-and-dance where I, the client, have to read a .Xauthority file and pass some of the contents to the server to prove that I’m authorized to access the server. The whole thing seems silly. The client-server connection uses unix domain sockets, which can provide access control either via file permissions on the socket or via SO_PEERCRED. I assume that there either is some legacy reason for still using .Xauthority, or I’m missing something about why it is used (maybe related to X11 forwarding in SSH?). At least Xwayland enforces .Xauthority, so I had to implement this.

The format of .Xauthority is not documented in any of the actual X11 documentation (at least not that I could find). The only “official” description I found was in the README for libxau.

As a sidenote, the way libX11 is structured is that libX11 depends on libxcb (which is a lower-level X11 client library), and libxcb in turn depends on libxau.

This is where I first got the feeling of “is this really a protocol?”: Parts of what a client needs to do are only documented in the one library which the main client implementation uses.

Either way, with authorization handled I was now able to open a blank window.

Hardware-accelerated graphics?

I had previously heard that if you don’t use libX11 you effectively also cannot use graphics APIs.

The reason for this is that the graphics API and window API have to tie together somehow. E.g. in the case of OpenGL via EGL this is done via eglCreatePlatformWindowSurface (see this example). But that function takes a libX11 Window * as a parameter. Furthermore mesa, which contains one (the most commonly used?) implementation of OpenGL and EGL, is using libxcb internally to handle the Window * which got passed to eglCreatePlatformWindowSurface.

So even if I were to implement the client side of the X11 protocol in my application and create something which looks like a libX11 Window * (which is fine since libX11 Window is just a uint32_t identifier, and I’ll have the equivalent identifier in my code), under the hood I’d still be using libX11 (or rather libxcb) because the graphics driver uses it.

So I guess I could use graphics APIs, but since libxcb would be dynamically loaded at that point I might as well have used it.

Putting pixels on the screen anyways

Ok, fine. But at least I can do software rendering. Writing software rendering code is good fun too, so it’s not all bad.

X11 includes a PutImage command for sending pixels to the server.

I ran into two hurdles when trying to use PutImage:

First, the data being submitted has to be in a format accepted by the server. The X11 server lists a number of formats (“visuals”) which it accepts. For now I’ve ignored this, because using one uint32_t per pixel with 8+8+8 bit RGB seems to just work, but I think I’m technically at the mercy of the server here. The server lists supported visuals, but I’ve not really been able to make sense of the data yet. xdpyinfo prints out the visuals supported by the server in a readable format, but on my machine (using Xwayland) I have >200 visuals, most of which contain duplicates of the same data (class = TrueColor or DirectColor, depth = 24 or 32.) over and over. No idea what I’m supposed to do with this.

So I’ve ignored the first hurdle. On to the second one:

X11 limits how many bytes can be put in one command. The limit is specified by the server in maximum-request-length when the client connects. This field is a uint16_t and counts number of DWORDS, i.e. a PutImage command can contain at most 4*0xffff bytes, minus some protocol overhead. That means a 1920x1080 window requires 32 separate PutImage commands to fully update the window. This is sort-of solved by the big requests extension, which the server can optionally support. It’s another song-and-dance to set up, but then my Xwayland version allows up to ~16 MB of data in one request, enough for a full 1920x1080 window. I think the proper way of working around this is to use the MIT-SHM extension which allows putting the backbuffer in shared memory instead of sending pixel data over a socket. I’ve not tried to use this yet.

Additionally probably the XPresent extension is useful (it allows doing vsync?), but this is another extension which I don’t think is documented except for via the implementation (libxpresent).

None of this is the end of the world. But I’m starting to have a worrying amount of code in my application. And perhaps more worryingly, a lot of this is “at the mercy of the server” style code: If the server supports big requests / MIT-SHM / XPresent I want to use them, but if not I in theory need to include fallback code (or crash :^)). And maybe worse, if the server doesn’t support the “visual” I want to use I need to include code to convert my backbuffer to a different pixel format.

Keyboard input

There is some keyboard input stuff defined in the base specification. But after I had implemented support for that, I realized it didn’t handle anything except the US keyboard layout. I don’t use the US layout and sometimes I switch between different keyboard layouts and I wanted that to work properly.

Enter the X Keyboard Extension (XKB).

I ended up writing ~500 lines of code to deal with XKB (but I still haven’t added code for dead keys…). Overall it is more code than I would have liked (but that is true for all of this project), but the code is fairly straight forward.

The code boils down to:

  1. Get the keyboard map from the server (and re-get it if it changes, e.g. because the user changes keyboard layout).
  2. Whenever a key is pressed, use the keyboard map to look up what symbol it produces.
  3. Translate the symbol into either a unicode codepoint or some “function key” (arrow keys, enter, backspace, etc.)

At first, I was overwhelmed by the amount of information in the keyboard map which the X11 server wanted to give me: Key types, key syms, modifier map, explicit components, key actions, etc. But at some point I realized that I only need key types and key syms to do event translation on the client side. All the other information is only used by the server. It is just exposed by the server because the client can not only get the keyboard map, but also set it in order to change keyboard configuration (though probably the only application doing that should be the keyboard config GUI shipped by the desktop environment?).

The client library documentation for XKB contains an overview of the parts that the client application needs to do.

A lookup in the keyboard map converts a key event into a KEYSYM. This can be a letter encoded in unicode, a “function key” (there is a table of all function keys), or a letter encoded in some legacy encoding. I guess the legacy encodings are to be expected since X11 is an old protocol. The base specification helpfully provides a table mapping codepoints in the legacy encodings to the corresponding unicode codepoint. But it is a bit unfortunate that the client has to include this translation table (my normal keyboard layout seems to produce legacy encodings for non-ascii letters, so I really can’t skip having this table).

One thing I was wondering is why the server can’t do the key event to KEYSYM translation. The server has all the information that is needed and it already resolves/handles a lot of the parts of the keyboard map, just leaving the key types and key syms tables for the client to deal with. Why not go all the way? Maybe I’m missing something, but I suspect it’s a matter of “this is how the spec was defined 40 years ago”.

Conclusion

I don’t have a good conclusion. I learnt a bunch about how X11 works and I hope I can transfer the knowledge to other future work. I’m unsure whether I’ll continue working on my X11-without-Xlib code or whether I’ll leave it, but I had some fun writing the code either way.

There are a lot of other things I haven’t talked about (e.g. apparently the format for the DISPLAY environment variable which Xlib parses isn’t documented, and just defined by whatever strchr/strrchr-spaghetti they have in the library?). And also many things which I would have been curious to read more about.

The title, X11 is not a protocol, comes from the feeling I got while working on this project that it isn’t sensible to write a X11 client without using Xlib or xcb. And even if it were feasible, a lot of things seemingly are only specified through the documentation of the libraries (as opposed to protocol documentation) or even de-facto by what is implemented in the libraries.

I wonder whether a protocol was ever something that was intended to come out of X11, or whether the fact that there is a protocol is just incidental to the fact that it is a server-client architecture?

Next up maybe I’ll try doing the same for wayland. Surely that will be fun.