### Forward-propagation model of a diffractive camera

For a diffractive camera with *N* diffractive layers, the forward propagation of the optical field can be modeled as a sequence of (1) free-space propagation between the *l*th and (*l* + 1)th layers (\(l=0, 1, 2, \dots , N\)), and (2) the modulation of the optical field by the *l*th diffractive layer (\(l=1, 2, \dots , N)\), where the 0th layer denotes the input/object plane and the (*N* + 1)th layer denotes the output/image plane. The free-space propagation of the complex field is modeled following the angular spectrum approach [45]. The optical field \({u}^{l}\left(x, y\right)\) right after the *l*th layer after being propagated for a distance of \(d\) can be written as [46]:

$$\begin{array}{c}{\mathbb{P}}_{\mathbf{d}}{ u}^{l}\left(x,y\right)={\mathcal{F}}^{-1}\left\{\mathcal{F}\left\{{u}^{l}\left(x,y\right)\right\}H({f}_{x},{f}_{y};d)\right\}\end{array}$$

(2)

where \({\mathbb{P}}_{\mathbf{d}}\) represents the free-space propagation operator, \(\mathcal{F}\) and \({\mathcal{F}}^{-1}\) are the two-dimensional Fourier transform and the inverse Fourier transform operations, and \(H({f}_{x}, {f}_{y};d)\) is the transfer function of free space:

$${H\left( {{f_x},{f_y};d} \right) = \left\{ {\begin{array}{*{20}{l}}{{\rm{exp}}\left\{ {jkd\sqrt {1 - {{\left( {\frac{{2\pi {f_x}}}{k}} \right)}^2} - {{\left( {\frac{{2\pi {f_y}}}{k}} \right)}^2}} } \right\},}&{f_x^2 + f_y^2 < \frac{1}{{{\lambda ^2}}}}\\{0,}&{f_x^2 + f_y^2 \ge \frac{1}{{{\lambda ^2}}}}\end{array}} \right.}$$

(3)

where \(j=\sqrt{-1}\), \(k= \frac{2\pi }{\lambda }\) and \(\lambda\) is the wavelength of the illumination light. \({f}_{x}\) and \({f}_{y}\) are the spatial frequencies along the \(x\) and \(y\) directions, respectively.

We consider only the phase modulation of the transmitted field at each layer, where the transmittance coefficient \({t}^{l}\) of the *l*th diffractive layer can be written as:

$$\begin{array}{c}{t}^{l}\left(x,y\right)=exp\left\{j{\phi }^{l}\left(x,y\right)\right\}\end{array}$$

(4)

where \({\phi }^{l}\left(x,y\right)\) denotes the phase modulation of the trainable diffractive neuron located at \(\left(x,y\right)\) position of the *l*th diffractive layer. Based on these definitions, the complex optical field at the output plane of a diffractive camera can be expressed as:

$$\begin{array}{c}o\left(x,y\right)={\mathbb{P}}_{{\mathbf{d}}_{{\varvec{N}},{\varvec{N}}+1}}\left(\prod_{l=N}^{1}{t}^{l}\left(x,y\right)\cdot {\mathbb{P}}_{{\mathbf{d}}_{{\varvec{l}}-1,\boldsymbol{ }\boldsymbol{ }{\varvec{l}}}}\right)g(x,y)\end{array}$$

(5)

where \({d}_{l-1,l}\) represents the axial distance between the (*l* − 1)th and the *l*th layers, \(g\left(x,y\right)\) is the input optical field, which is the amplitude of the input objects (handwritten digits) used in this work.

### Training loss function

The reported diffractive camera systems were optimized by minimizing the loss functions that were calculated using the intensities of the input and output images. The input and output intensities \(G\) and \(O\), respectively, can be written as:

$$\begin{array}{c}G\left(x,y\right)={\left|g\left(x,y\right)\right|}^{2}\end{array}$$

(6)

$$\begin{array}{c}O\left(x,y\right)={\left|o\left(x,y\right)\right|}^{2}\end{array}$$

(7)

The loss function, calculated using a batch of training input objects \({\varvec{G}}\) with the corresponding output images \({\varvec{O}}\) can be defined as:

$$\begin{array}{c}Loss\left({\varvec{O}},{\varvec{G}}\right)=Los{s}_{+}\left({{\varvec{O}}}^{+}, {{\varvec{G}}}^{+}\right)+ Los{s}_{-}\left({{\varvec{O}}}^{-},\boldsymbol{ }{{\varvec{G}}}^{-},{G}_{k}^{+}\right)\end{array}$$

(8)

where \({{\varvec{O}}}^{+},{{\varvec{G}}}^{+}\) represent the output and input images from the target data class (i.e., desired object class), and \({{\varvec{O}}}^{-},{{\varvec{G}}}^{-}\) represent the output and input images from the other data classes (to be all-optically erased), respectively.

The \(Los{s}_{+}\) is designed to reduce the NMSE and enhance the correlation between any target class input object \({O}^{+}\) and its output image \({G}^{+}\), so that the diffractive camera learns to faithfully reconstruct the objects from the target data class, i.e.,

$$\begin{array}{c}Los{s}_{+}\left({O}^{+},{G}^{+}\right)={\alpha }_{1}\times NMSE\left({O}^{+}, { G}^{+}\right)+ {\alpha }_{2}\times \left(1-\mathrm{PCC}\left({O}^{+}, {G}^{+}\right)\right)\end{array}$$

(9)

where \({\alpha }_{1}\) and \({\alpha }_{2}\) are constants and NMSE is defined as:

$$\begin{array}{c}NMSE\left({O}^{+},{G}^{+}\right)=\frac{1}{MN}\sum_{m,n}{\left(\frac{{O}_{m,n}^{+}}{\mathrm{max}({O}^{+})}-{G}_{m,n}^{+}\right)}^{2}\end{array}$$

(10)

\(m\) and \(n\) are the pixel indices of the images, and \(MN\) represents the total number of pixels in each image. The output image \({O}^{+}\) was normalized by its maximum pixel value, \(\mathrm{max}({O}^{+})\). The PCC value between any two images \(A\) and \(B\) is calculated using [38]:

$$\begin{array}{c}PCC(A,B)=\frac{\sum \left(A-\overline{A }\right)\left(B-\overline{B }\right)}{\sqrt{\sum {\left(A-\overline{A }\right)}^{2}\sum {\left(B-\overline{B }\right)}^{2}}}\end{array}$$

(11)

The term \(\left(1-\mathrm{PCC}\left({O}^{+}, {G}^{+}\right)\right)\) was used in \(Los{s}_{+}\) in order to maximize the correlation between \({O}^{+}\) and \({G}^{+}\), as well as to ensure a non-negative loss value since the PCC value of any two images is always between − 1 and 1.

Different from\(Los{s}_{+}\), the \(Los{s}_{-}\) function is designed to *reduce* (1) the *absolute* correlation between the output \({O}^{-}\) and its corresponding input \({G}^{-}\), (2) the *absolute* correlation between \({O}^{-}\) and an arbitrary object \({G}_{k}^{+}\) from the target class, and (3) the correlation between \({O}^{-}\) and itself shifted by a few pixels \({O}_{\mathrm{sft}}^{-}\), which can be formulated as:

$$\begin{array}{c}Los{s}_{-}\left({O}^{-},{G}^{-},{G}_{k}^{+}\right)={\beta }_{1}\times \left|\mathrm{PCC}\left({O}^{-}, {G}^{-}\right)\right|+{\beta }_{2}\times \left|\mathrm{PCC}\left({O}^{-}, {G}_{k}^{+}\right)\right|+ {\beta }_{3}\times PCC\left({O}^{-}, {O}_{\mathrm{sft}}^{-}\right)\end{array}$$

(12)

where \({\beta }_{1}\), \({\beta }_{2}\) and \({\beta }_{3}\) are constants. Here the \({G}_{k}^{+}\) refers to an image of an object from the target data class in the training set, which was randomly selected for every training batch, and the subscript \(k\) refers to a random index. In other words, within each training batch, the \(\mathrm{PCC}\left({O}^{-}, {G}_{k}^{+}\right)\) was calculated using the output image from the non-target data class and a random ground truth image from the target class. By adding such a loss term, we prevent the diffractive camera from converging to a solution where all the output images look like the target object. The \({O}_{\mathrm{sft}}^{-}\) was obtained using:

$$\begin{array}{c}{O}_{\mathrm{sft}}^{-} \left(x,y\right)={O}^{-}\left(x-{s}_{x},y-{s}_{y}\right)\end{array}$$

(13)

where \({s}_{x}={s}_{y}=5\) denote the number of pixels that \({O}^{-}\) is shifted in each direction. Intuitively, a natural image will maintain a high correlation with itself, shifted by a small amount, while an image of random noise will not. By minimizing \(\mathrm{PCC}\left({O}^{-}, {O}_{\mathrm{sft}}^{-}\right)\), we forced the diffractive camera to generate uninterpretable noise-like output patterns for input objects that do not belong to the target data class.

The coefficients \(\left({\alpha }_{1}, {\alpha }_{2}, {\beta }_{1},{\beta }_{2},{\beta }_{3}\right)\) in the two loss functions were empirically set to (1, 3, 6, 3, 2).

### Digital implementation and training scheme

The diffractive camera models reported in this work were trained with the standard MNIST handwritten digit dataset under \(\lambda =0.75\; \mathrm{mm}\) illumination. Each diffractive layer has a pixel/neuron size of 0.4 mm, which only modulates the phase of the transmitted optical field. The axial distance between the input plane and the first diffractive layer, the distances between any two successive diffractive layers, and the distance between the last diffractive layer and the output plane are set to 20 mm, i.e., \({d}_{l-1,l}=20\;\mathrm{ mm }\,(l=1,2,\dots , N+1)\). For the diffractive camera models that take a single MNIST image as its input (e.g., reported in Figs. 2, 3), each diffractive layer contains 120 \(\times\) 120 diffractive pixels. During the training, each 28 \(\times\) 28 MNIST raw image was first linearly upscaled to 90 \(\times\) 90 pixels. Next, the upscaled training dataset was augmented with random image transformations, including a random rotation by an angle within \([-10^\circ , +10^\circ ]\), a random scaling by a factor within [0.9, 1.1], and a random shift in each lateral direction by an amount of \([-2.13\lambda , +2.13\lambda ]\).

For the diffractive camera model reported in Fig. 4 that takes multiplexed objects as its input, each diffractive layer contains 300 \(\times\) 300 diffractive pixels. The MNIST training digits were first upscaled to 90 \(\times\) 90 pixels and then randomly transformed with \([-10^\circ , +10^\circ ]\) angular rotation, [0.9, 1.1] scaling, and \([-2.13\lambda , +2.13\lambda ]\) translation. Nine different handwritten digits were randomly selected and arranged into 3 \(\times\) 3 grids, generating a multiplexed input image with 270 \(\times\) 270 pixels for the diffractive camera training.

For the diffractive permutation camera reported in Fig. 5, each diffractive layer contains 120 \(\times\) 120 diffractive pixels. The design parameters of this class-specific permutation camera were kept the same as the five-layer diffractive camera reported in Fig. 3a, except that the handwritten digits were down-sampled to 15 \(\times\) 15 pixels considering that the required computational training resources for the permutation operation increase quadratically with the total number of input image pixels. The MNIST training digits were augmented using the same random transformations as described above. The 2D permutation matrix \({\varvec{P}}\) was generated by randomly shuffling the rows of a 225 \(\times\) 225 identity matrix. The inverse of \({\varvec{P}}\) was obtained by using the transpose operation, i.e., \({{\varvec{P}}}^{-1}={{\varvec{P}}}^{{\varvec{T}}}\). The training loss terms for the class-specific permutation camera remained the same as described in Eqs. (8), (9), and (12), except that the permuted input images (\({\varvec{P}}G\)) were used as the ground truth, i.e.,

$$\begin{array}{c}Los{s}_{\mathrm{Permutation}}\left(O,{\varvec{P}}G\right)\,=\,Los{s}_{+}\left({O}^{+}, {{\varvec{P}}G}^{+}\right)+ Los{s}_{-}\left({O}^{-},\boldsymbol{ }{{\varvec{P}}G}^{-},{{\varvec{P}}G}_{k}^{+}\right)\end{array}$$

(14)

For the seven-layer diffractive linear transformation camera reported in Fig. 6, each diffractive layer contains 300 \(\times\) 300 diffractive neurons, and the axial distance between any two consecutive planes was set to 45 mm (i.e., \({d}_{l-1,l}=20\) mm, for \(l=1, 2, \dots , N+1)\). The 2D linear transformation matrix \({\varvec{T}}\) was generated by randomly creating an invertible matrix with each row having 20 non-zero random entries, and normalized so that the summation of each row is 1 (for conserving energy); see Fig. 6 for the selected \({\varvec{T}}\). The invertibility of \({\varvec{T}}\) was validated by calculating its determinant. During the training, the loss functions were applied to the diffractive camera output and the ground truth after the inverse linear transformation, i.e., \({{\varvec{T}}}^{-1}O\) and \({{\varvec{T}}}^{-1}({\varvec{T}}G)\). The other details of the training loss terms for the class-specific linear transformation camera remained the same as described in Eqs. (8), (9), and (12).

The diffractive camera trained with the Fashion MNIST dataset (reported in Additional file 1: Fig. S2) contains seven diffractive layers, each with 300 \(\times\) 300 pixels/neurons. The axial distance between any two consecutive planes was set to 45 mm (i.e., \({d}_{l-1,l}=20\) mm, for \(l=\mathrm{1,2},\dots , N+1)\). During the training, each Fashion MNIST raw image was linearly upsampled to 90 \(\times\) 90 pixels and then augmented with random transformations of \([-10^\circ , +10^\circ ]\) angular rotation, [0.9, 1.1] physical scaling, and \([-2.13\lambda , +2.13\lambda ]\) lateral translation. The loss functions used for training remained the same as described in Eqs. (8), (9), and (12).

The spatial displacement-agnostic diffractive camera design with the larger input FOV (reported in Additional file 4: Movie S3) contains seven diffractive layers, each with 300 \(\times\) 300 pixels/neurons. The axial distance between any two consecutive planes was set to 45 mm (i.e., \({d}_{l-1,l}=20\) mm, for \(l=\mathrm{1,2},\dots , N+1)\). During the training, each MNIST raw image was linearly upsampled to 90 × 90 pixels, and then was randomly placed within a larger input FOV of 140 × 140 pixels for training. The loss functions were the same as described in Eqs. (8), (9), and (12). The input objects distributed within a FOV of 120 × 120 pixels were demonstrated during the blind testing shown in Additional file 4: Movie S3.

The MNIST handwritten digit dataset was divided into training, validation, and testing datasets without any overlap, with each set containing 48,000, 12,000, and 10,000 images, respectively. For the diffractive camera trained with the Fashion MNIST dataset, five different classes (i.e., trousers, dresses, sandals, sneakers, and bags) were selected for the training, validation, and testing, with each set containing 24,000, 6000, and 5000 images without overlap, respectively.

The diffractive camera models reported in this paper were trained using the Adam optimizer [47] with a learning rate of 0.03. The batch size used for all the trainings was 60. All models were trained and tested using PyTorch 1.11 with a GeForce RTX 3090 graphical processing unit (NVIDIA Inc.). The typical training time for a three-layer diffractive camera (e.g., in Fig. 2) is ~ 21 h for 1000 epochs.

### Experimental design

For the experimentally validated diffractive camera design shown in Fig. 7, an additional contrast loss \({\mathrm{L}}_{\mathrm{c}}\) was added to \(Los{s}_{+}\) i.e.,

$$\begin{array}{c}Los{s}_{+}\left({O}^{+},{G}^{+}\right)\,=\,{\alpha }_{1}\times NMSE\left({O}^{+}, { G}^{+}\right)+ {\alpha }_{2}\times \left(1-\mathrm{PCC}\left({O}^{+}, {G}^{+}\right)\right)+{\alpha }_{3}\times {\mathrm{L}}_{\mathrm{c}}\left({O}^{+}, {G}^{+}\right)\end{array}$$

(15)

The coefficients \(\left({\alpha }_{1}, {\alpha }_{2}, {\alpha }_{3}\right)\) were empirically set to (1, 3, 5) and \({\mathrm{L}}_{\mathrm{c}}\) is defined as:

$$\begin{array}{c}{\mathrm{L}}_{\mathrm{c}}\left({O}^{+}, {G}^{+}\right)=\frac{\sum \left({O}^{+}\cdot \left(1-\widehat{{G}^{+}}\right)\right)}{\sum \left({O}^{+}\cdot \widehat{{G}^{+}}\right)+\varepsilon }\end{array}$$

(16)

where \(\varepsilon =1{\mathrm{e}}^{-6}\) was added to the denominator to avoid divide-by-zero error. \(\widehat{{G}^{+}}\) is a binary mask indicating the transmissive regions of the input object \({G}^{+}\), which is defined as:

$${\widehat {{G^ + }}\left( {m,n} \right) = \left\{ {\begin{array}{*{20}{l}}{1,}&{{G^ + }(m,n) > 0.5}\\{0,}&{otherwise}\end{array}} \right.}$$

(17)

By adding this image contrast related training loss term, the output images of the target objects exhibit enhanced contrast which is especially helpful in non-ideal experimental conditions.

In addition, the MNIST training images were first linearly downsampled to 15 × 15 pixels and then upscaled to 90 × 90 pixels using nearest-neighbor interpolation. Then, the resulting input objects were augmented using the same parameters as described before and were fed into the diffractive camera for training. Each diffractive layer had 120 × 120 trainable diffractive neurons.

To overcome the challenges posed by the fabrication inaccuracies and mechanical misalignments during the experimental validation of the diffractive camera, we vaccinated our diffractive model during the training by deliberately introducing random displacements to the diffractive layers [41]. During the training process, a 3D displacement \({\varvec{D}}= \left({D}_{x},{ D}_{y},{ D}_{z}\right)\) was randomly added to each diffractive layer following the uniform \({\text{(U)}}\) random distribution:

$$\begin{array}{c}{D}_{x} \sim {\text{U}}\left(-{\Delta }_{x, tr}, {\Delta }_{x, tr}\right)\end{array}$$

(18)

$$\begin{array}{c}{D}_{y} \sim {\text{U}}\left(-{\Delta }_{y, tr}, {\Delta }_{y,tr}\right)\end{array}$$

(19)

$$\begin{array}{c}{D}_{z} \sim {\text{U}}\left(-{\Delta }_{z, tr}, {\Delta }_{z,tr}\right)\end{array}$$

(20)

where \({D}_{x}\) and \({D}_{y}\) denote the random lateral displacement of a diffractive layer in \(x\) and \(y\) directions, respectively. \({D}_{z}\) denotes the random displacement added to the axial distances between any two consecutive diffractive layers. \({\Delta }_{*, tr}\) represents the maximum amount of shift allowed along the corresponding axis, which was set as \({\Delta }_{x,tr}={\Delta }_{y,tr}=\) 0.4 mm (~ 0.53\(\lambda\)), and \({\Delta }_{z, tr}=\) 1.5 mm (2\(\lambda\)) throughout the training process. \({D}_{x},{ D}_{y}\), and \({D}_{z}\) of each diffractive layer were independently sampled from the given uniform random distributions. The diffractive camera model used for the experimental validation was trained for 50 epochs.

### Experimental THz imaging setup

We validated the fabricated diffractive camera design using a THz continuous wave scanning system. The phase values of the diffractive layers were first converted into height maps using the refractive index of the 3D printer material. Then, the layers were printed using a 3D printer (Pr 110, CADworks3D). A layer holder that sets the positions of the input plane, output plane, and each diffractive layer was also 3D printed (Objet30 Pro, Stratasys) and assembled with the printed layers. The test objects were 3D printed (Objet30 Pro, Stratasys) and coated with aluminum foil to define the transmission areas.

The experimental setup is illustrated in Fig. 7a. The THz source used in the experiment was a WR2.2 modular amplifier/multiplier chain (AMC) with a compatible diagonal horn antenna (Virginia Diode Inc.). The input of AMC was a 10 dBm RF input signal at 11.1111 GHz (*f*_{RF1}) and after being multiplied 36 times, the output radiation was at 0.4 THz. The AMC was also modulated with a 1 kHz square wave for lock-in detection. The output plane of the diffractive camera was scanned with a 1 mm step size using a single-pixel Mixer/AMC (Virginia Diode Inc.) detector mounted on an XY positioning stage that was built by combining two linear motorized stages (Thorlabs NRT100). A 10 dBm RF signal at 11.083 GHz (*f*_{RF2}) was sent to the detector as a local oscillator to down-convert the signal to 1 GHz. The down-converted signal was amplified by a low-noise amplifier (Mini-Circuits ZRL-1150-LN+) and filtered by a 1 GHz (± 10 MHz) bandpass filter (KL Electronics 3C40-1000/T10-O/O). Then the signal passed through a tunable attenuator (HP 8495B) for linear calibration and a low-noise power detector (Mini-Circuits ZX47-60) for absolute power detection. The detector output was measured by a lock-in amplifier (Stanford Research SR830) with the 1 kHz square wave used as the reference signal. Then the lock-in amplifier readings were calibrated into linear scale. A digital 2 × 2 binning was applied to each measurement of the intensity field to match the training feature size used in the design phase.