Implementation Details

1. XRender Overview:

XRender itself is not very well documented, so I'll give here a short overview about its concepts.
A more detailed documentation is the protocol specification, which is, however, not always accurat.
The most recent version I was able to find is located here.

Introduced 2001 with XFree86-4.0.1 as X11 Extension
Designed to fix limitations of X11 “core drawing”
Very flexible API, ~99% of Java2D's functionality without ugly workarrounds
(Almost) not documented
Hard to guess which features are accelerated, many corner-cases
Only rarely used in a direct way, only through Cairo or QT4
Chicken/Egg problem (no one uses it because its slow, no one tuned it because its not used)
Sometimes hard to accelerate with fixed-function GPUs

1.1 XRender compared to the X11 drawing model:
Instead of XLib's GCs, XRender works with „Pictures".
A Picture is quite equal to a GC or a Graphics2D object in Java, it stores the surface state - and one surface can have mutiple pictures assoziated with it, each holding its own state information.
However some states account only when used as source, some only when used as destination and some for both.
Support for A8, RGB24 and ARGB32 surfaces is guaranteed by the Render specification and furthermore XRender now allows composition of pictures with different depths.

1.2 Composition:
The central functionality of XRender is Composition with mask or without, which maps quite well with Java2Ds concept of paints.
Geometry is drawn into a mask which then is used for the composite operation.
The only two geometry types XRender provides are Trapezoids and Rectangles, however trapezoids are not accalerated by any driver for now.
Its limited geometry support is a real weakness of XRender.
Also text-rendering is interpreted as some kind of composition.
The glyphs are first blited to a temporary mask, following a composition with that mask:

Transformation can be applied to source and mask-pictures.

2. Pipeline design overview:

2.1 The "old" XRender pipeline:
The "old" XRender Java2D pipeline was a "traditional" pipeline which followed the design of the X11 pipeline.
Communication with the native code is done over JNI, and a lot of logic was scattered between Java and C code, sometimes in a redundant way.
This approach was hard to maintain, furthermore JNI-overhead showed up for small primitives.

Design of the "old" XRender pipeline

2.2 The rewritten Pipeline:
The goal of the rewrite was to reduce the JNI overhead as far as possible and to ease maintenance as well as further improvements.
This is archieved by generating the X11/XRender protocol directly with Java-Code, and using XCB's socket handoff functionality.

Java code
Native C code
XServer process

XCB is a low-level replacement for libX11, which also offers a libX11 compatibility library.
XCB's socket handoff functionality allows to send self-generated X11 protocol to the XServer.
It allows sharing the socket with native code, which is important considering the whole AWT code is still based on the native libX11/libxcb.
Unfourtunatly XCB's socket handoff functionality isn't stable and causes frequent deadlocks inside XCB code.

This led to the development of two different backends:

Pure Java backend
The pure Java backend generates X11 protocoll directly in Java-code and flushes the data to the native socket when either the buffer is full or a native library requests access to XCB.
All code executed is Java and can be highly optimized and inlined by the Java Runtime.
Native backend:
- Compatibility solution until XCB bugs are fixed
- Works on systems without (or old) XCB based libX11
- Lower performance, one JNI call per X-request (even more JNI overhead than "old" pipeline)

3. Java2D -> XRender mapping:

3.1 Aliased rendering:

Java2D transforms complex shapes into many zero-width lines and rectangles, therefor processing those lines and rectangles fast is critical for archieving high performance.
An early prototype composited rectangle-by-rectangle without using a mask but it turned out that composition for small areas has a too high setup costs.
The current design buffers all rects and lines, then draws them in one go to the mask-picture (instead of drawing scanline-by-scanline like the old pipeline did) and uses that mask for composition.
There are several optimizations implemented, like drawing rectangles directly to destination when a solid color is used as source.

Filled shape, rendered with 1px high rectangles (scanlines)
Different colors used for better illustration

XRender does not support zero-width lines, the only possibility to draw that often used primitive is to emulate it with Trapezoids, with a lot of complex geometric calculations going on to guarantee the line is really a zero-with-line (Hobby Pen Polygon).
The XRender pipeline works arround that issue by drawing diagonal lines to a Mask using a X11-GC.

EXA itself does not accalerate diagonal lines, so rendering the lines to the same mask as the rectangles does cause excessive migration.
The solution choosen was to draw diagonal lines to a seperate mask which is never marked to have a modified VRAM "copy", therefor only sysram->vram migration happens which is relativly fast.

UXA faces the same problem, however it doesn't provide a workarround, therefor all lines are converted to rectangles using a Bresenham's line rendering algorythm.

3.2 Antialiased Rendering:

Antialiasing is currently done in the same way as by the D3D/OGL pipeline:
Mask tiles are generated in software, uploaded to the GPU and composition is accalerated finally.
Although composition is now accalerated, direct image access using shm pixmaps is not possible when running on EXA, which means the mask-tiles have to be transported over the X11 network protocol which currently only results in performance compareble to the X11 pipeline running on XAA. However when running on EXA or over network performance is a lot better compared to the X11 pipeline.
For the pipeline there is almost no difference between MaskFills and MaskBlits, solid MaskFills simply use a 1x1 repeating picture to store the color-value.

Further optimizations could include tile-buffering for large AA'ed shapes or buffering the AA tiles in the MaskBuffer, avoiding many small composition operations.

Antiliased shape with Gradient source.
Composition with <= 32x32 tiles, covereage calculated by CPU.
No RAM->VRAM copy for fully covered tiles (blue).

3.3 Text Rendering:

XRender has a very flexible text API.
Glyphs are uploaded to the XServer, and later referenced by a unique ID.
Basic Layout-Information is stored per-glyph (XGlyphInfo), however its possible to influence positioning in a relative manner using the XGlyphElt structures at rendering time.
The correction done by XGlyphElt does not overwrite the positioning information stored in XGlyphInfo, but is added to.
All text antialiasing modes are supported and accalerated.

Different antialising modes supported.

3.4 Paint support:

Java2D supports the following paint-modes:

Solid Colors
Texture
Gradients
XOR

All paint-modes except XOR can be combined with Alpha blending.

Solid:

For a solid colors a 1x1 pixmap is first filled with the color and later used as source-picture in the composition step.
If no blending is required, the backend reduces this operation to a direct XRenderFillRectangle call.

Texture:

For TexturePaints a texture-picture is used as source:

Gradient:

Linear- and radial gradients are supported, with the exception of radial gradients where focus-point != center point.

Currently no XRender capable driver is able to accalerate gradients, therefor gradients cause software-fallbacks on the server-side.
In EXA the gradient "surface" is pinned to sysram, causing all other surfaces which are involved in the composition migrated back to sysram which is an relativly expensive operation.
The XRender pipeline is able to pre-generate a gradient to a surface never marked to have a modified VRAM "copy", which improves gradient performance on current EXA/driver combinations a lot.

However when gradients are accalerated, this would be unescessary overhead, so this workarround can be disabled:

-Dsun.java2d.gradcache=false

XOR:

Bitwise XOR (as expected by Java2D) is only supported by X11 core drawing, XRender only supports XOR specified by Porter-Duff.
XOR was introduced with Java-1.0, and is often used by legacy software to paint selected areas to save repaints.

The fact that the highest-rated bug for the new Direct3D pipeline was about XOR performance problems, seems to indicate still a lot of software is in use depending in this feature.
To avoid slow fallbacks, especially in the remote case, the pipline uses an X11-GC for aliased fills (lines, rectangles).

Texture & Radial Gradient Paint

3.5 Image/Blit support:

The XRender pipeline accalerates all various blit operations supported by Java2D, including:

Linear & Billinear interpolation
Transformation
Optimized image upload for images which can not be cached (so called Software-to-Surface Blits)

Billinear interpolation is a bit tricky, because XRender interpolates with the border-value by default - whereas Java2D specifies the image border as not interpolated. This was solved by setting the repeat-mode to RepeatPad (similar to GL_CLAMP in OpenGL) and by using a transformed mask with a rectangle rendered into it.
By adjusting the transformation of the mask, its often possible to save a lot of fillrate (by e.g. using a scale transformation).
This way additional overhead can be reduced to a minimum.

4x4 image with transformation applied
1. Java2D, 2. XRender

Image with non-interpolated border