Every FPGA tutorial tells you to open the IP catalog and drag in the UART core. Double-click to configure, connect the ports, done. But IP cores are black boxes — and black boxes fail in ways you cannot debug, cannot optimize, and cannot understand.
Building a UART from scratch takes less than 150 lines of SystemVerilog. More importantly, it forces you to understand a concept that underlies dozens of serial protocols: how two devices communicate without sharing a clock. Once this clicks, I2C, SPI, and USB start to make sense at a much deeper level.
1. The UART Frame: A Clockless Protocol
UART (Universal Asynchronous Receiver-Transmitter) has no clock line. Instead, both devices agree on a baud rate — the number of bits per second — beforehand. The transmitter drives bits at exactly that rate, and the receiver samples them at the same rate using its own local clock.
The most common configuration is 8N1: 8 data bits, No parity bit, 1 stop bit. The line is idle high. A transmission always begins with a low start bit, which the receiver uses to synchronize. Data is sent LSB-first. A high stop bit ends the frame and returns the line to idle.
localparam int CLK_FREQ = 100_000_000;
localparam int BAUD_RATE = 115_200;
localparam int BAUD_DIV = CLK_FREQ / BAUD_RATE;
2. UART TX: The Transmitter
The transmitter is a four-state FSM. It waits in IDLE until the host asserts tx_start for one clock cycle. It then drives the start bit, shifts out all 8 data bits LSB-first using a baud counter to time each bit, drives the stop bit, and returns to idle. The tx_busy flag prevents the host from writing new data mid-frame.
module uart_tx #(
parameter int CLK_FREQ = 100_000_000,
parameter int BAUD_RATE = 115_200
)(
input logic clk,
input logic rst,
input logic [7:0] tx_data,
input logic tx_start,
output logic tx_busy,
output logic tx_line
);
localparam int BAUD_DIV = CLK_FREQ / BAUD_RATE;
typedef enum logic [1:0] {
IDLE = 2’b00,
START = 2’b01,
DATA = 2’b10,
STOP = 2’b11
} state_t;
state_t state;
logic [15:0] baud_cnt;
logic [2:0] bit_idx;
logic [7:0] shift_reg;
always_ff @(posedge clk or posedge rst) begin
if (rst) begin
state <= IDLE;
tx_line <= 1’b1;
tx_busy <= 1’b0;
baud_cnt <= ‘0;
bit_idx <= ‘0;
shift_reg <= ‘0;
end else begin
case (state)
IDLE: begin
tx_line <= 1’b1;
tx_busy <= 1’b0;
if (tx_start) begin
shift_reg <= tx_data;
tx_busy <= 1’b1;
baud_cnt <= ‘0;
state <= START;
end
end
START: begin
tx_line <= 1’b0;
if (baud_cnt == BAUD_DIV – 1) begin
baud_cnt <= ‘0;
bit_idx <= ‘0;
state <= DATA;
end else
baud_cnt <= baud_cnt + 1;
end
DATA: begin
tx_line <= shift_reg[0];
if (baud_cnt == BAUD_DIV – 1) begin
baud_cnt <= ‘0;
shift_reg <= shift_reg >> 1;
if (bit_idx == 3’d7)
state <= STOP;
else
bit_idx <= bit_idx + 1;
end else
baud_cnt <= baud_cnt + 1;
end
STOP: begin
tx_line <= 1’b1;
if (baud_cnt == BAUD_DIV – 1) begin
state <= IDLE;
baud_cnt <= ‘0;
end else
baud_cnt <= baud_cnt + 1;
end
endcase
end
end
endmodule
3. UART RX: Sampling at the Right Moment
The receiver is where most implementations go wrong. The naive approach samples the line at the beginning of each bit period. The problem: the transmitter’s clock and the receiver’s clock are never perfectly in sync. By the time you reach bit 7, the accumulated drift can push you close to a bit boundary, where even a small error reads the wrong value.
The industry-standard solution is mid-bit sampling. When the receiver detects the falling edge of the start bit, it waits for half a bit period before taking its first sample. This aligns all subsequent samples to the center of each bit — as far from both edges as possible, where the signal is most stable.
Why the start bit half-wait matters: The falling edge that triggers the receiver could arrive at any point in the receiver’s clock cycle. Waiting BAUD_DIV/2 cycles after detection places the sample point at the center of the start bit. From that point, waiting exactly BAUD_DIV cycles between each subsequent sample keeps all 8 data bit samples centered as well.
module uart_rx #(
parameter int CLK_FREQ = 100_000_000,
parameter int BAUD_RATE = 115_200
)(
input logic clk,
input logic rst,
input logic rx_line,
output logic [7:0] rx_data,
output logic rx_valid
);
localparam int BAUD_DIV = CLK_FREQ / BAUD_RATE;
localparam int BAUD_DIV_HALF = BAUD_DIV / 2;
typedef enum logic [1:0] {
IDLE = 2’b00,
START = 2’b01,
DATA = 2’b10,
STOP = 2’b11
} state_t;
state_t state;
logic [15:0] baud_cnt;
logic [2:0] bit_idx;
logic [7:0] shift_reg;
always_ff @(posedge clk or posedge rst) begin
if (rst) begin
state <= IDLE;
rx_valid <= 1’b0;
rx_data <= ‘0;
baud_cnt <= ‘0;
bit_idx <= ‘0;
shift_reg <= ‘0;
end else begin
rx_valid <= 1’b0;
case (state)
IDLE: begin
if (!rx_line) begin
baud_cnt <= ‘0;
state <= START;
end
end
START: begin
if (baud_cnt == BAUD_DIV_HALF – 1) begin
if (!rx_line) begin
baud_cnt <= ‘0;
bit_idx <= ‘0;
state <= DATA;
end else
state <= IDLE;
end else
baud_cnt <= baud_cnt + 1;
end
DATA: begin
if (baud_cnt == BAUD_DIV – 1) begin
baud_cnt <= ‘0;
shift_reg <= {rx_line, shift_reg[7:1]};
if (bit_idx == 3’d7)
state <= STOP;
else
bit_idx <= bit_idx + 1;
end else
baud_cnt <= baud_cnt + 1;
end
STOP: begin
if (baud_cnt == BAUD_DIV – 1) begin
if (rx_line) begin
rx_data <= shift_reg;
rx_valid <= 1’b1;
end
state <= IDLE;
baud_cnt <= ‘0;
end else
baud_cnt <= baud_cnt + 1;
end
endcase
end
end
endmodule
4. Putting It All Together
The top-level module exposes two separate interfaces: the physical UART pins that connect to a board-level USB-UART chip, and discrete data ports that the rest of your design uses to send and receive bytes. The internal loopback is gone — application logic above this module decides what to do with received data. To implement loopback externally, simply wire rx_data → tx_data and rx_valid → tx_start.
module uart_top #(
parameter int CLK_FREQ = 100_000_000,
parameter int BAUD_RATE = 115_200
)(
input logic clk,
input logic rst,
input logic uart_rxd,
output logic uart_txd,
input logic [7:0] tx_data,
input logic tx_start,
output logic tx_busy,
output logic [7:0] rx_data,
output logic rx_valid
);
uart_rx #(CLK_FREQ, BAUD_RATE) rx_inst (
.clk (clk), .rst (rst),
.rx_line (uart_rxd),
.rx_data (rx_data),
.rx_valid (rx_valid)
);
uart_tx #(CLK_FREQ, BAUD_RATE) tx_inst (
.clk (clk), .rst (rst),
.tx_data (tx_data),
.tx_start (tx_start),
.tx_busy (tx_busy),
.tx_line (uart_txd)
);
endmodule
5. The Testbench: Two Nodes, One Channel
With discrete data ports on uart_top, the testbench becomes a direct driver — no bit-banging required. The testbench drives nodeA‘s tx_data and tx_start ports as if from an upstream state machine. nodeA.uart_txd connects to nodeB.uart_rxd over the physical UART line. The testbench reads nodeB.rx_data directly when nodeB.rx_valid pulses and asserts the byte is intact.
Single-hop latency: Each byte now traverses one UART frame — nodeA TX to nodeB RX. At 115200 baud that is approximately 87 µs per test vector (~350 µs total). Never reduce BAUD_DIV to speed up simulation; it hides the timing bugs you are trying to catch.
module uart_tb;
localparam int CLK_FREQ = 100_000_000;
localparam int BAUD_RATE = 115_200;
localparam int CLK_PERIOD = 10;
logic clk, rst;
logic nodeA_txd, nodeB_txd;
logic [7:0] nodeA_tx_data;
logic nodeA_tx_start, nodeA_tx_busy;
logic [7:0] nodeB_rx_data;
logic nodeB_rx_valid;
uart_top #(CLK_FREQ, BAUD_RATE) nodeA (
.clk (clk), .rst (rst),
.uart_rxd (nodeB_txd),
.uart_txd (nodeA_txd),
.tx_data (nodeA_tx_data),
.tx_start (nodeA_tx_start),
.tx_busy (nodeA_tx_busy),
.rx_data (), .rx_valid ()
);
uart_top #(CLK_FREQ, BAUD_RATE) nodeB (
.clk (clk), .rst (rst),
.uart_rxd (nodeA_txd),
.uart_txd (nodeB_txd),
.tx_data (8’h00), .tx_start (1’b0), .tx_busy (),
.rx_data (nodeB_rx_data),
.rx_valid (nodeB_rx_valid)
);
always #(CLK_PERIOD / 2) clk = ~clk;
task automatic test_tx_rx(input logic [7:0] data);
wait (!nodeA_tx_busy);
@(posedge clk);
nodeA_tx_data = data;
nodeA_tx_start = 1’b1;
@(posedge clk);
nodeA_tx_start = 1’b0;
@(posedge nodeB_rx_valid);
@(posedge clk);
assert (nodeB_rx_data == data)
$display(“PASS: nodeA sent 0x%02h, nodeB received 0x%02h”, data, nodeB_rx_data);
else
$error (“FAIL: nodeA sent 0x%02h, nodeB received 0x%02h”, data, nodeB_rx_data);
endtask
initial begin
clk = 1’b0;
rst = 1’b1;
nodeA_tx_data = 8’h00;
nodeA_tx_start = 1’b0;
repeat(4) @(posedge clk);
rst = 1’b0;
repeat(4) @(posedge clk);
test_tx_rx(8’hA5);
test_tx_rx(8’h00);
test_tx_rx(8’hFF);
test_tx_rx(8’h55);
repeat(10) @(posedge clk);
$display(“— All tests complete —“);
$finish;
end
endmodule
Final Thoughts: The IP Core Was Hiding This
A UART is 130 lines of SystemVerilog. That is all. Every byte your keyboard sends to your PC, every debug print statement from a microcontroller, every sensor reading streamed over a serial port — all of it runs on exactly this logic: a baud counter, a shift register, and four states.
Now open the datasheet for I2C. You will see start conditions, ACK bits, and clock stretching — but underneath it all you will recognize the same pattern: a state machine counting clock cycles to drive and sample a line at exactly the right moment. Building a UART from scratch does not just give you a UART. It gives you a template for every serial protocol you will ever implement.
Happy coding.
fpgawizard.com