The Source Code - Programming and Mathematics

Programming & Mathematics

This section contains an overview of the math in this system, and how we built it to be versatile, but at the same time fast and efficient. The method we use may seem a bit weird at first, but it is very fast and works well.

NOTE: This document is still a draft. Still working on correcting typos etc.

Introduction
The basic single-qubit matrices we use are these.

The Pauli matrices:

$$ I = \left[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}\right] $$

$$ X = \left[\begin{matrix} 0 & 1 \\ 1 & 0 \end{matrix}\right] $$

$$ Y = \left[\begin{matrix} 0 & -i \\ i & 0 \end{matrix}\right] $$

$$ I = \left[\begin{matrix} 1 & 0 \\ 0 & -1 \end{matrix}\right] $$

Hadamard and phase gates:

$$ H = \frac{1}{\sqrt{2}}\left[\begin{matrix} 1 & 1 \\ 1 & -1 \end{matrix}\right] $$

$$ S = \left[\begin{matrix} 1 & 0 \\ 0 & i \end{matrix}\right] $$

$$ T = \left[\begin{matrix} 1 & 0 \\ 0 & e^{\frac{i\pi}{4}} \end{matrix}\right] $$

We also have a programmable phase gate, where you set the angle yourself.

$$ P(\theta) = \left[\begin{matrix} 1 & 0 \\ 0 & e^{i \theta} \end{matrix}\right] $$

All of these gates, except for the Hadamard, are general permutation matrices. A general permutation matrix (in our system) is a matrix on the form G = DP, where P is a permutation matrix, and D is a diagonal matrix with complex entries (that are allowed to be 0).

$$ D = \left[\begin{matrix} z_0 & 0 \\ 0 & z_1 \end{matrix}\right] $$

This prompted us to add a data structure for general permutation matrices. But before getting into that, we will look at vectors.

Vectors
The vectors are qubit states, which means their dimension is always a power of two. Vectors with (at most) one non-zero element occurs frequently in this application, for example - any vector that represent a (computational) basis state, regardless of dimension, is a vector of that kind. Also - any such vector that passes through a single- or multiqubit gate composed only out of general permutation matrices, would also be in that form. In our system, that would be every gate where Hadamard gates are not involved. This holds regardless of whether or not the gates are controlled.

We call such a vector ComplexVectorPerm, as they can be used to represent a row in a general permutation matrix. Let 'z' be a complex value and |b> an N-dimensional computational basis state. All ComplexVectorPerm instances is then on the form $$z|b\rangle$$.

The data type used to represent a ComplexVectorPerm looks like this:

ComplexVectorPerm implements ComplexVector {

Variables:

byte type 

int dimension

int position

Complex value

Methods:

add, multiply, etc.'

}

We have an interface for vectors called ComplexVector, which all vectors must implement.

'type' is just a flag. There are three types of vectors at this point - GPWZ, genPerm, and sparse. GPWZ means "general permutation with zeroes", and is used both for all-zero vectors, and general permutation matrices that has at least one all-zero row. ComplexVectorPerm can be of type GPWZ or genPerm.

'position' is the position of the non-zero element, and is an int.

'value' is the value of that element.

'dimension' is the dimension of the vector An example vector would be:

ComplexVectorPerm cvp

cvp.type = genPerm

cvp.dimension = 4

cvp.position = 0

cvp.value = i

This would be the vector (i,0,0,0) or i|00>

If the vector only contains zero elements, the type is "GPWZ", the position -1, and the value 'null'. Here is an example:

ComplexVectorPerm cvp

cvp.type = GPWZ

cvp.dimension = 8

cvp.position = -1

cvp.value = null

This would be the vector (0,0,0,0,0,0,0,0). It's not a valid qubit state, of course, but its valid as a row in some matrices.

ComplexVectorSparse is the other vector type. It essentially just holds a (pos,value)-map and implements the vector interface. We use it even for fully dense vectors at this point, and its just a regular hashmap, which brings great shame to me and my family. It will be replaced with a faster collection later (when external dependencies works better), and the dense vector will be re-introduced and used whenever the number of elements exceed 0.75 * dimension or so.

Matrices
There are only two matrix types at this time (2014-02-03). The first one is ComplexMatrixPerm.

ComplexMatrixPerm is a general permutation matrix, and is essentially just an array of ComplexVectorPerm.

MatrixGenPerm implements ComplexMatrix {

byte type 

int dimension

ComplexVectorPerm[] rows

int[] rowByPos

Methods:

add, multiply(vector), multiply(matrix), ...

}

'type' is the same as with vectors. There are three matrix types as well, GPWZ, genPerm and partition.

'dimension' is the dimension. All gate matrices are square, so only one value is needed.

'rows' is the row vectors, and the size of the array is 'dimension'.

'rowByPos' is just an array of ints storing row by position, so that you can fetch the row based on the position of the element, and not just the position by fetching the row. Its not essential, but allows for O(1) matrix-vector multiplication and O(n) matrix-matrix between genPerms.

General permutation matrices has got (at most) one non-zero element per row and column, which makes them more akin to sparse vectors then matrices, only you have to associate each value with a row and a position, instead of just a position. They are super light, and super fast to work with, and are used a lot of the time. This in itself is not that useful, however. The performance would be pretty much the same if using a generic sparse/dense system. The usefulness becomes more clear when working with larger and more complex matrices.

Anyways, the second matrix type is ComplexMatrixPartition, which contains an array of genPerms. The genPerms are treated as a sum when performing operations. For instance, if a ComplexMatrixPartition has the array of genPerms {A,B}, a matrix-vector addition would be computed (A + B)v = Av + Bv = u

ComplexMatrixPartition implements ComplexMatrix {

ComplexMatrixPerm[] genPerms;

Methods:

add, multiply...

}

The point of the partition is to support matrices like the Hadamard. We have $$H = \frac{X}{\sqrt 2} + \frac{Z}{\sqrt 2}$$, which means it is the sum of two genPerms, and therefore a ComplexMatrixPartition. We can also use partition matrices to represent any dense 2x2 unitaries, not just the Hadamard.

This is also useful when it comes to representing large tensor products containing 1 Hadamard and the rest genPerms, such as (H x Y x Y x Y). We often encounter these type of matrices since they represent non-controlled N qubit gates (a 4 qubit gate in this case). The tensor product is distributive over addition. If we omit the square-root term for brevity, this can be rewritten:

$$H \otimes Y \otimes Y \otimes Y = (X + Z) \otimes Y \otimes Y \otimes Y = (X \otimes Y \otimes Y \otimes Y) + (Z \otimes Y \otimes Y \otimes Y)$$

A tensor product of any number of genPerms is itself a genPerm, so what we got here is nothing other then the sum of two genPerms, and thus a ComplexMatrixPartition.

Dense Matrices
The result above makes the genPerm matrix practical, but its real strength shows when we must compute and apply very dense matrices.

Take the matrix M = (H x H x H x H), a fully dense 16x16 matrix. We could partition this matrix like before, but it would result in 16 genPerms. However, as you will see now, we never have to consider terms with more then one "non-genPerm" in them. We start by using a property of the tensor product to write this as a matrix product of four terms:

$$H \otimes H \otimes H \otimes H = (H \otimes I \otimes I \otimes I)(I \otimes H \otimes I \otimes I)(I \otimes I \otimes H \otimes I)(I \otimes I \otimes I \otimes H)$$

Each term is a 2 term partition matrix, that contains 32 elements (2 * 16). This means we have to generate 128 elements in total. The 16x16 fully dense matrix requires 256. Actually, as the matrix dimension grows, the number of elements we have to calculate when partitioning shrinks exponentially. The number of factors is linear in the number of gates (N), and each factor has only got 2 elements per row, which means we have to compute 2N elements in total per row. The number of elements per row in the dense version is $$2^N$$. Furthermore, we never have to multiply these matrices together to create the dense matrix, because gates always acts on qubit states, which means that we can simply multiply the state vectors with one factor at a time in succession. For example, we have:

$$M|\psi\rangle = |\phi\rangle$$

Without loss of generality we assume the state vector is fully dense. M has 16 elements per row, so we must do 16 multiplications and 15 additions per matrix row to get the product. This means 16*16 = 256 multiplications in all (+ the addition).

Now let $$M_k$$ be the matrix with H in place k, and I otherwise:

$$M|\psi\rangle = (H \otimes H \otimes H \otimes H)|\psi\rangle = M_0M_1M_2M_3|\psi\rangle = M_0M_1M_2|\psi_1\rangle = $$ $$M_0M_1|\psi_2\rangle = M_0|\psi_3\rangle = |\phi\rangle $$

The Factors has only 2 elements per row, so each mat-vec multiplication requires 2*16 = 32 element multiplications, and 16 additions. There are four factors in total, so 128 multiplications in all. We have exponential gain here as well.

This adds to the speed, but it also reduces memory usage. Technically we only compute one of these "factors" at a time, so we only have to allocate memory for one. In terms of matrix dimension, this reduces the memory needed from d*d to 2*d. Also, this works for controlled gates as well, but to verify that we must look at the gate generation algorithm.

Gate generation
At the core of the gate matrix generation system is an algorithm DennisMcKinnon wrote to generate matrices for any possible combination of gates and controls. The input to the generation function 'G' is an array of gate indices (java unsigned bytes), where each index represents either a control or a gate. The function returns the gate matrix:

M = G(arg0,arg1,...)

If 'C' is used for controls, a cNOT gate would be specified like this:

M = G(C,X)

The Toffoli gate and its permutations would be specified like this:

G(C,C,X), G(C,X,C), G(X,C,C).

This algorithm lets the user create multigates using any combination of single qubit gates.

When it comes to combining gates - those that has no controls in them are considered un-linked, and each single qubit gate is applied locally. Let's say two qubits enter into two Z gates. If they do not share their states with any other qubits,this is done simply by multiplying the state of each qubit with Z. If they have a shared state, the gate applied to qubit 0 would be Z ⊗ I, and the other I ⊗ Z. It works this way because tensoring all gates together would require us to tensor all qubit states together, and that should really not be done unless absolutely necessary. If all qubits are already in a combined state, however, the matrices are all tensored into one.

There are cases when the qubits passing through a multigate are linked with other qubits, but they aren't affected by the gate. In cases like that, the gate is padded with identity matrices. Let's say qubits 1, 2 and 3 share a state, and they pass through a cNOT that acts only on qubits 1 and 2, the gate G(C,X,I) would be applied to the entire combined state, instead of G(C,X).

Finally, two qubits sharing a state (qubit 1 and 2) might enter a multi-gate in such a way that qubit 2 would have position 1 in the gate, and vice versa.

Qubit 1 |X>

Qubit 2 |C>

There are ways to prevent ordering issues like that. One way would be to apply a swap gate on the qubits first, then run the cNOT, then do another swap. That is what we did at first. The method we use now is to apply an algorithm that switches and organize the gate indices instead, before applying the actual gate creation algorithm. This is more efficient, as it runs in O(N), where N is the number of gates (or qubits), while computing and applying a swap matrix is done in O(2N).

As an example of this, lets say two qubits are passed through a Toffoli gate, G(C,C,X). Qubits 1 and 3 have a shared state, and qubit 2 is free. In this case, the combined state would be formed by tensoring the combined state of 1 and 3 with the state of 2 (and the state would get the internal qubit ordering 1, 3, 2). The gate is then switched to G(C,X,C). This is faster then splitting the state, or performing swaps, and the effect is the same. The ordering of the qubits in the 3-qubit combined state would of course be (1,3,2), but after the state has been combined, the indices are changed to (1,2,3), and all qubits that are affected by this change is updated.

If qubit 2 above was in a combined state as well (lets say in position 1 in that state), the other qubit would have been unaffected by the gate, but we would have had to include it in the calculation. We would have created a 4-qubit combined state, and instead of the ordering (1,3,2) we would have had (1, 3, 2-1, 2-2), and the gate would have been G(C,X,C,I).

Controlled gate generation
The way we generate control matrices as fast as possible is to always sort the controls first, then the unitaries. For example, lets say the gate sorting stuff gave us the gate G(C,X,X,Z,C,Y). The first thing we do is to move the gates so that all controls end up first. In this case it would mean swapping gates 1 and 4 (counting from 0). The result would be: G(C,C,X,X,Z,X,Y).

During the first step, all the new mappings are stored, so that we know which gates were swapped. This 'pattern' is then used to do an optimized swap on the vector (more on this later), then apply the gate, then swap again. With highly optimized I mean the algorithm runs in O(ab) time, where a is the number of controls, and b the number of non-zero elements in the vector. It is capped at O(n log n). Also, the swap algorithm is essentially just a series of bit operations.

NOTE: The optimized swap approach to controlled gate generation is new /2014, february), so I will probably be able to remove the old swap-avoiding gate-sorting stuff, as a swap will have to be done anyways. Or two swaps, rather. The reason we avoided swaps altogether at first, was because of how controlled gates were generated (through a recursion algorithm), but that has since been replaced.

Factoring controlled gates

In terms of the gate generation function G, the factoring rule of non-controlled gates could be written:

$$G(U_1,U_2,\ldots,U_N) = G(U_1,I,\ldots,I)G(I,U_2,\ldots,I) \ldots G(I,\ldots,I,U_N)$$

Here, the U's are unitary 2x2 Matrices. Technically it would work for larger unitary matrices as well, to a lesser degree, but we'll not get into that now.

To show that this works for controlled gates as well, We will start with controlled gates on this form:

$$G(C_1,C_2,...,C_M,U_1,U_2,...,U_N)$$

The C's are control gates, and the U's are 2x2 unitary matrices. Technically the C gates does not need to be indexed, as they all do the same thing, but its useful when discussing swaps as we will do later.

What we want to show is that factoring is possible:

$$G(C_1,C_2,\ldots,C_M,U_1,U_2,\ldots,U_N) = $$

$$G(C_1,C_2,\ldots,C_M,U_1,I,\ldots,I) \ldots G(C_1,C_2,\ldots,C_M,I,\ldots,I,U_N)$$

First of all - the dimension of this matrix is $$2^{M + N}$$, where M is the number of controls, and N is the number of 2x2 unitary gates. If M = 0, this would be just a regular non-controlled gate, for which we know that factoring works. If N = 0, this matrix would simply be I x I x ... x I (M terms), and do nothing to the qubit state its applied to. This state has no unitaries, and therefore does not even need to be factored. To show the general case (when neither M nor N is 0) we will need the following symbols:

$$I_N =\left[\begin{matrix}1 & 0 \\ 0 & 1 \end{matrix}\right]^{\otimes N}$$

$$0_N =\left[\begin{matrix}0 & 0 \\ 0 & 0 \end{matrix}\right]^{\otimes N} $$

$$U = U_1 \otimes U_2 \otimes \ldots \otimes U_N$$.

Using these symbols, The matrix for G can be expressed as a diagonal block matrix:

$$ \left[\begin{matrix} I_N & 0_N & \cdots & 0_N \\ 0_N & I_N & \cdots & 0_N \\ \vdots & \vdots & \ddots & \vdots \\ 0_N & 0_N & \cdots & U \end{matrix}\right] $$

There would be 2^M matrices in each row and column. This is the type of matrices we use to represent control gates.

Now, let T be the tensor product factorization:

$$T_j = I \otimes I \otimes \ldots \otimes U_j \otimes I \otimes \ldots \otimes I$$

We get:

$$ \left[\begin{matrix} I_N & 0_N & \cdots & 0_N \\ 0_N & I_N & \cdots & 0_N \\ \vdots & \vdots & \ddots & \vdots \\ 0_N & 0_N & \cdots & T_0 \end{matrix}\right] \left[\begin{matrix} I_N & 0_N & \cdots & 0_N \\ 0_N & I_N & \cdots & 0_N \\ \vdots & \vdots & \ddots & \vdots \\ 0_N & 0_N & \cdots & T1 \end{matrix}\right] \ldots \left[\begin{matrix} I_N & 0_N & \cdots & 0_N \\ 0_N & I_N & \cdots & 0_N \\ \vdots & \vdots & \ddots & \vdots \\ 0_N & 0_N & \cdots & T_N \end{matrix}\right] = $$

$$ \left[\begin{matrix} I_N & 0_N & \cdots & 0_N \\ 0_N & I_N & \cdots & 0_N \\ \vdots & \vdots & \ddots & \vdots \\ 0_N & 0_N & \cdots & T_0 T_1 \ldots T_N \end{matrix}\right] = \left[\begin{matrix} I_N & 0_N & \cdots & 0_N \\ 0_N & I_N & \cdots & 0_N \\ \vdots & \vdots & \ddots & \vdots \\ 0_N & 0_N & \cdots & U \end{matrix}\right] $$

We can therefore factor G in this case as well.

The next step is to show that this applies to all controlled gates, regardless of how controls and unitaries are placed. We will use swap gates for that.

If we want to change the gate G(X,C) into G(C,X), for example, we can do that with swap gates. For example, if $$ S_{(0,1)}$$ is the matrix that swaps qubits 0 and 1, we have the have the following relation:

$$G(C,X) = S_{(0,1)} G(X,C) S_{(0,1)}$$

An individual qubit swap is self-inverse, that is $$S_{(a,b)}S_{(a,b)} = I$$. We also have $$S_{(a,b)} = S_{(b,a)}$$

We can create any permutation of qubits by applying a series of individual qubit swap gates in succession. The inverse of the operation would be to apply the individual qubit swap gates in reversed order. If indexed S stands for arbitrary single qubit swaps, we have:

$$S = S_0S_1 \ldots S_n$$

$$S^{-1} = S_nS_{n - 1} \ldots S_0$$

If we refer to gates that has all its controls before the unitaries as being "split", this statement holds:

For any gate matrix $$H = G(A_0,A_1,...,A_{M + N})$$, where M is the number of controls, N is the number of 2x2 unitaries, and $$A_k$$ represents either a 2x2 unitary or a control, there exists a matrix S that splits H. That is:

$$SHS^{-1} = G(C_0,C_1,...,C_M,U_1,U_2,...,U_N)$$

Furthermore, the matrix S is a swap matrix.

Using this result, the proof is trivial. Let $$J = SHS^{-1}$$. Also, let $$H_k$$ be the version of H where each unitary except the k'th one is replaced with an I. S would split that as well, and we choose S so that we get $$SH_kS^{-1} = J_k$$for all k.

J is split, so it can be factored. We therefore have $$H = S^{-1}JS = S^{-1}(J_0J_1J_2 \ldots J_N)S$$. Since $$SS^{-1} = S^{-1}S = I$$, we can also write this:

$$H = S^{-1}JS = S^{-1}J_0(SS^{-1})J_1(SS^{-1})J_2S \ldots S^{-1}J_NS = $$

$$(S^{-1}J_0S)(S^{-1}J_1S)(S^{-1}J_2S) \ldots (S^{-1}J_NS) = H_0H_1 \ldots H_N$$.

Optimized swap
We do not apply swap gate matrices to the qubit states. Our swap instead works like this:

Imagine you have a vector that represents the combined state of three arbitrary qubits:

$$q_0 = \left[\begin{matrix} \alpha_0 \\ \alpha_1 \end{matrix}\right], q_1 = \left[\begin{matrix} \beta_0 \\ \beta_1 \end{matrix}\right], q_2 = \left[\begin{matrix} \gamma_0 \\ \gamma_1 \end{matrix}\right]$$

The combined state would be:

$$|\psi\rangle = \left[\begin{matrix} \alpha_0 \beta_0 \gamma_0 \\ \alpha_0 \beta_0 \gamma_1 \\ \alpha_0 \beta_1 \gamma_0 \\ \alpha_0 \beta_1 \gamma_1 \\ \alpha_1 \beta_0 \gamma_0 \\ \alpha_1 \beta_0 \gamma_1 \\ \alpha_1 \beta_1 \gamma_0 \\ \alpha_1 \beta_1 \gamma_1 \end{matrix}\right]$$

Now let us ignore the alphas and betas and whatnot, and just use the indices. Keep in mind - this is still a vector. There is 1 element per row. The n'th number is just the index to use for the n'th qubit, for example: 0 0 1 means the element is $$\alpha_0 \beta_0 \gamma_1 $$.

$$|\psi\rangle = \left[\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 1 & 1

\end{matrix}\right]$$

Notice these numbers form the binary value representing the row number, except that qubit 0 is farthest to the left instead of the right, but that is simple to reverse. Also, this can of course be extended to any number of qubits.

If we want to swap qubits 0 and 1, we would have to ensure the indices of column 0 and 1 are switched, whilst the index of column 2 remains the same. This could be done by swapping rows 2->4 and 3->5. It would take two swaps (counting rows from 0).

In the general case, if we want to swap qubits 'i' and 'j', and the total number of qubits is N, this is the algorithm:

1) Invert i and j using the formula bit_k = N - 1 - k.

1) Find the states where bit_i is 0, and the bit_j is 1. There are $$2^{N - 2}$$ such states. This could be done fast and easy by holding bit_i = 0 and bit_j = 1 constant, and create all possible combinations of the other bits..

2) For each such state, get the decimal value from the bit string. Call it 's'. Then get the decimal value after switching bit i to 1, and j to 0. Call it 't'. Then swap row s with row t.

Note that this holds when we're dealing with an entangled state as well, for the same reason a swap gate works regardless of the type of state.

Example

Lets use the 3 qubit case and swap qubits 1 and 2 (still counting from 0). We have i = 1, j = 2 and N = 3. This means we have 2^(3-1) = 2 rows we need to find:.

001 = 1, 101 = 5

Note that with the bit reversal, bit 2 (which would represent qubit 0 when reading right-to-left) is the variable one. Bits 0 and 1 are fixed (representing qubits 2 and 1, respectively).

The "opposites" would be:

010 = 2, 110 = 6

This means we need to swap row 1 with row 2, and row 5 with row 6.

Let us confirm by calculating the swap matrix 'M' that swaps qubits 1 and 2. Let S be the typical 4x4 swap matrix. We get:

$$ M = I \otimes S =

\left[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}\right] \otimes \left[\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}\right] =

\left[\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}\right] $$

As we can see, it prompts us to do the exact same switching of rows.

Example

Lets say we have a 4 qubit state, and we want to swap qubits 1 and 2 (counting from 0). We need to find the states where column 1 = 0 and column 2 = 1. There are 4 such states. It is easy to find out which they are by inspection:

0010 = 2, 0011 = 3, 1010 = 10, 1011 = 11

Now we need to find the rows with column 1 = 1 and column 2 = 0.

0100 = 4, 0101 = 5, 1100 = 12, 1101 = 13

If we swap these rows, so that the two swapped rows has the same values for bit 0 and 3, we will have swapped qubits 1 and 2. It would mean swapping row 2 with 4, 3 with 5, 10 with 12, and 11 with 13.

Example

Lets test this for the 2 qubit case, just to show. There is only one swap possible, which is qubit 0 to qubit 1.

The four possible states are these:

$$|\psi\rangle = \left[\begin{matrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{matrix}\right]$$

Now, since N = 2, and 2^(N - 2) = 2^(2 - 2) = 1, this means we have 1 state with column 0 = 0 and column 1 = 1. That happens to be 01 = 1.

We now find the index for 10, which is 2. So we swap rows 1 and 2 (again - counting from 0), giving us the proper swap.

Lets look at the swap matrix:

$$ \left[\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}\right] $$

This matrix of course tells us to change rows 1 and 2 (still counting from 0).

The road ahead
At this point we are trying to work in user-defined NxN gates. These are often hard to partition and fit into the system. We may deal with this by re-introducing fully dense and sparse matrix formats. Another thing I'm looking at is to use other methods of partitioning/factoring. Currently I'm looking at sine-cosine decomposition, Householder reflections, and some other things.

This is not a big deal though. The worst case scenario is that we end up with "normal (sparse) matrix speed" for weird user-defined matrices, but can still perform massive hadamard transforms, QFTs etc. at a very high speed. The benefit will still remain, and work in most cases. I highly doubt that expert scientists will use minecraft to do serious research, so large user-defined matrices are probably not going to be that common....