The Essential UART Guide: Transmitter, Receiver, and Testbench From Scratch

Every FPGA tutorial tells you to open the IP catalog and drag in the UART core. Double-click to configure, connect the ports, done. But IP cores are black boxes — and black boxes fail in ways you cannot debug, cannot optimize, and cannot understand.

Building a UART from scratch takes less than 150 lines of SystemVerilog. More importantly, it forces you to understand a concept that underlies dozens of serial protocols: how two devices communicate without sharing a clock. Once this clicks, I2C, SPI, and USB start to make sense at a much deeper level.

1. The UART Frame: A Clockless Protocol

UART (Universal Asynchronous Receiver-Transmitter) has no clock line. Instead, both devices agree on a baud rate — the number of bits per second — beforehand. The transmitter drives bits at exactly that rate, and the receiver samples them at the same rate using its own local clock.

The most common configuration is 8N1: 8 data bits, No parity bit, 1 stop bit. The line is idle high. A transmission always begins with a low start bit, which the receiver uses to synchronize. Data is sent LSB-first. A high stop bit ends the frame and returns the line to idle.

uart_frame.sv
// UART 8N1 frame — one character transmission // // IDLE START D0 D1 D2 D3 D4 D5 D6 D7 STOP IDLE // ____ ____ ____ // | | | | | | | | | | | | // |____| |______|______|______|______|______|______|______|______| | // // Each cell above = exactly (CLK_FREQ / BAUD_RATE) clock cycles wide. // For a 100 MHz clock at 115200 baud: 100_000_000 / 115_200 = 868 cycles per bit. localparam int CLK_FREQ = 100_000_000; localparam int BAUD_RATE = 115_200; localparam int BAUD_DIV = CLK_FREQ / BAUD_RATE; // 868

2. UART TX: The Transmitter

The transmitter is a four-state FSM. It waits in IDLE until the host asserts tx_start for one clock cycle. It then drives the start bit, shifts out all 8 data bits LSB-first using a baud counter to time each bit, drives the stop bit, and returns to idle. The tx_busy flag prevents the host from writing new data mid-frame.

uart_tx.sv
module uart_tx #( parameter int CLK_FREQ = 100_000_000, parameter int BAUD_RATE = 115_200 )( input logic clk, input logic rst, input logic [7:0] tx_data, // byte to transmit input logic tx_start, // pulse high for 1 cycle to send output logic tx_busy, // high while transmission in progress output logic tx_line // connect to FPGA pin ); localparam int BAUD_DIV = CLK_FREQ / BAUD_RATE; typedef enum logic [1:0] { IDLE = 2’b00, START = 2’b01, DATA = 2’b10, STOP = 2’b11 } state_t; state_t state; logic [15:0] baud_cnt; logic [2:0] bit_idx; logic [7:0] shift_reg; always_ff @(posedge clk or posedge rst) begin if (rst) begin state <= IDLE; tx_line <= 1’b1; // idle high tx_busy <= 1’b0; baud_cnt <= ‘0; bit_idx <= ‘0; shift_reg <= ‘0; end else begin case (state) IDLE: begin tx_line <= 1’b1; tx_busy <= 1’b0; if (tx_start) begin shift_reg <= tx_data; // latch data before shifting tx_busy <= 1’b1; baud_cnt <= ‘0; state <= START; end end START: begin tx_line <= 1’b0; // drive start bit low if (baud_cnt == BAUD_DIV 1) begin baud_cnt <= ‘0; bit_idx <= ‘0; state <= DATA; end else baud_cnt <= baud_cnt + 1; end DATA: begin tx_line <= shift_reg[0]; // LSB first if (baud_cnt == BAUD_DIV 1) begin baud_cnt <= ‘0; shift_reg <= shift_reg >> 1; // shift right, expose next bit if (bit_idx == 3’d7) state <= STOP; else bit_idx <= bit_idx + 1; end else baud_cnt <= baud_cnt + 1; end STOP: begin tx_line <= 1’b1; // stop bit high if (baud_cnt == BAUD_DIV 1) begin state <= IDLE; baud_cnt <= ‘0; end else baud_cnt <= baud_cnt + 1; end endcase end end endmodule

3. UART RX: Sampling at the Right Moment

The receiver is where most implementations go wrong. The naive approach samples the line at the beginning of each bit period. The problem: the transmitter’s clock and the receiver’s clock are never perfectly in sync. By the time you reach bit 7, the accumulated drift can push you close to a bit boundary, where even a small error reads the wrong value.

The industry-standard solution is mid-bit sampling. When the receiver detects the falling edge of the start bit, it waits for half a bit period before taking its first sample. This aligns all subsequent samples to the center of each bit — as far from both edges as possible, where the signal is most stable.

Why the start bit half-wait matters: The falling edge that triggers the receiver could arrive at any point in the receiver’s clock cycle. Waiting BAUD_DIV/2 cycles after detection places the sample point at the center of the start bit. From that point, waiting exactly BAUD_DIV cycles between each subsequent sample keeps all 8 data bit samples centered as well.
uart_rx.sv
module uart_rx #( parameter int CLK_FREQ = 100_000_000, parameter int BAUD_RATE = 115_200 )( input logic clk, input logic rst, input logic rx_line, // connect to FPGA pin output logic [7:0] rx_data, // received byte output logic rx_valid // pulses high for 1 cycle when rx_data is ready ); localparam int BAUD_DIV = CLK_FREQ / BAUD_RATE; localparam int BAUD_DIV_HALF = BAUD_DIV / 2; // mid-bit offset typedef enum logic [1:0] { IDLE = 2’b00, START = 2’b01, DATA = 2’b10, STOP = 2’b11 } state_t; state_t state; logic [15:0] baud_cnt; logic [2:0] bit_idx; logic [7:0] shift_reg; always_ff @(posedge clk or posedge rst) begin if (rst) begin state <= IDLE; rx_valid <= 1’b0; rx_data <= ‘0; baud_cnt <= ‘0; bit_idx <= ‘0; shift_reg <= ‘0; end else begin rx_valid <= 1’b0; // default: pulse for one cycle only case (state) IDLE: begin if (!rx_line) begin // falling edge detected = start bit baud_cnt <= ‘0; state <= START; end end START: begin // wait BAUD_DIV/2 to reach mid-point of start bit if (baud_cnt == BAUD_DIV_HALF 1) begin if (!rx_line) begin // still low? confirmed start bit baud_cnt <= ‘0; bit_idx <= ‘0; state <= DATA; end else state <= IDLE; // it was a glitch, discard end else baud_cnt <= baud_cnt + 1; end DATA: begin if (baud_cnt == BAUD_DIV 1) begin baud_cnt <= ‘0; shift_reg <= {rx_line, shift_reg[7:1]}; // shift in from MSB, LSB arrives first if (bit_idx == 3’d7) state <= STOP; else bit_idx <= bit_idx + 1; end else baud_cnt <= baud_cnt + 1; end STOP: begin if (baud_cnt == BAUD_DIV 1) begin if (rx_line) begin // valid stop bit = line is high rx_data <= shift_reg; rx_valid <= 1’b1; // pulse: data is ready end state <= IDLE; baud_cnt <= ‘0; end else baud_cnt <= baud_cnt + 1; end endcase end end endmodule

4. Putting It All Together

The top-level module exposes two separate interfaces: the physical UART pins that connect to a board-level USB-UART chip, and discrete data ports that the rest of your design uses to send and receive bytes. The internal loopback is gone — application logic above this module decides what to do with received data. To implement loopback externally, simply wire rx_data → tx_data and rx_valid → tx_start.

uart_top.sv
module uart_top #( parameter int CLK_FREQ = 100_000_000, parameter int BAUD_RATE = 115_200 )( input logic clk, input logic rst, // Physical UART pins — connect to board I/O constraints input logic uart_rxd, output logic uart_txd, // TX data interface — driven by application logic input logic [7:0] tx_data, input logic tx_start, // pulse high 1 cycle to send output logic tx_busy, // RX data interface — read when rx_valid pulses output logic [7:0] rx_data, output logic rx_valid ); uart_rx #(CLK_FREQ, BAUD_RATE) rx_inst ( .clk (clk), .rst (rst), .rx_line (uart_rxd), .rx_data (rx_data), .rx_valid (rx_valid) ); uart_tx #(CLK_FREQ, BAUD_RATE) tx_inst ( .clk (clk), .rst (rst), .tx_data (tx_data), .tx_start (tx_start), .tx_busy (tx_busy), .tx_line (uart_txd) ); endmodule

5. The Testbench: Two Nodes, One Channel

With discrete data ports on uart_top, the testbench becomes a direct driver — no bit-banging required. The testbench drives nodeA‘s tx_data and tx_start ports as if from an upstream state machine. nodeA.uart_txd connects to nodeB.uart_rxd over the physical UART line. The testbench reads nodeB.rx_data directly when nodeB.rx_valid pulses and asserts the byte is intact.

Single-hop latency: Each byte now traverses one UART frame — nodeA TX to nodeB RX. At 115200 baud that is approximately 87 µs per test vector (~350 µs total). Never reduce BAUD_DIV to speed up simulation; it hides the timing bugs you are trying to catch.
uart_tb.sv
`timescale 1ns/1ps module uart_tb; localparam int CLK_FREQ = 100_000_000; localparam int BAUD_RATE = 115_200; localparam int CLK_PERIOD = 10; // 10 ns = 100 MHz logic clk, rst; logic nodeA_txd, nodeB_txd; // nodeA discrete TX (driven by testbench) logic [7:0] nodeA_tx_data; logic nodeA_tx_start, nodeA_tx_busy; // nodeB discrete RX (read by testbench) logic [7:0] nodeB_rx_data; logic nodeB_rx_valid; // nodeA: testbench drives tx_data/tx_start → UART line → nodeB uart_top #(CLK_FREQ, BAUD_RATE) nodeA ( .clk (clk), .rst (rst), .uart_rxd (nodeB_txd), .uart_txd (nodeA_txd), .tx_data (nodeA_tx_data), .tx_start (nodeA_tx_start), .tx_busy (nodeA_tx_busy), .rx_data (), .rx_valid () ); // nodeB: receives on UART line, testbench reads rx_data/rx_valid uart_top #(CLK_FREQ, BAUD_RATE) nodeB ( .clk (clk), .rst (rst), .uart_rxd (nodeA_txd), // cross-connect: nodeA → nodeB .uart_txd (nodeB_txd), .tx_data (8’h00), .tx_start (1’b0), .tx_busy (), .rx_data (nodeB_rx_data), .rx_valid (nodeB_rx_valid) ); always #(CLK_PERIOD / 2) clk = ~clk; // Task: drive nodeA TX, wait for nodeB RX, assert byte integrity task automatic test_tx_rx(input logic [7:0] data); wait (!nodeA_tx_busy); // guard: do not pulse tx_start while busy @(posedge clk); nodeA_tx_data = data; nodeA_tx_start = 1’b1; @(posedge clk); nodeA_tx_start = 1’b0; @(posedge nodeB_rx_valid); // block until nodeB receives the byte @(posedge clk); assert (nodeB_rx_data == data) $display(“PASS: nodeA sent 0x%02h, nodeB received 0x%02h”, data, nodeB_rx_data); else $error (“FAIL: nodeA sent 0x%02h, nodeB received 0x%02h”, data, nodeB_rx_data); endtask initial begin clk = 1’b0; rst = 1’b1; nodeA_tx_data = 8’h00; nodeA_tx_start = 1’b0; repeat(4) @(posedge clk); rst = 1’b0; repeat(4) @(posedge clk); test_tx_rx(8’hA5); // mixed: 1010_0101 test_tx_rx(8’h00); // all zeros: 0000_0000 test_tx_rx(8’hFF); // all ones: 1111_1111 test_tx_rx(8’h55); // alternating: 0101_0101 (worst case) repeat(10) @(posedge clk); $display(“— All tests complete —“); $finish; end endmodule

Final Thoughts: The IP Core Was Hiding This

A UART is 130 lines of SystemVerilog. That is all. Every byte your keyboard sends to your PC, every debug print statement from a microcontroller, every sensor reading streamed over a serial port — all of it runs on exactly this logic: a baud counter, a shift register, and four states.

Now open the datasheet for I2C. You will see start conditions, ACK bits, and clock stretching — but underneath it all you will recognize the same pattern: a state machine counting clock cycles to drive and sample a line at exactly the right moment. Building a UART from scratch does not just give you a UART. It gives you a template for every serial protocol you will ever implement.


Happy coding.
fpgawizard.com

error: Selection is disabled!