RNNs vs Transformers: The Engineering Reality

Swopnil AcharyaSwopnil Acharya

What "One Word" Actually Means

When we say "processing one word," we're really talking about:

python
# A word is represented as a vector (array of numbers)
word_vector = [0.2, -0.1, 0.8, 0.3, -0.5, ...]  # 512 or 1024 dimensions typically

# "cat" might be: [0.1, 0.9, -0.2, 0.4, ...]
# "dog" might be: [0.2, 0.8, -0.1, 0.5, ...]
# "sat" might be: [-0.3, 0.1, 0.7, -0.2, ...]

So when I say "process one word," I mean: feed one vector through the neural network.

The Old Way: RNNs - Step by Step Processing

The Data Structure:

python
# Input sentence: "The cat sat on mat"
input_sequence = [
    [0.1, 0.2, 0.3, ...],  # "The" as vector
    [0.4, 0.5, 0.6, ...],  # "cat" as vector
    [0.7, 0.8, 0.9, ...],  # "sat" as vector
    [0.2, 0.1, 0.4, ...],  # "on" as vector
    [0.3, 0.6, 0.2, ...]   # "mat" as vector
]

The RNN Processing Loop:

python
def process_with_rnn(input_sequence):
    hidden_state = initialize_hidden_state()  # [0, 0, 0, ..., 0]

    for timestep, word_vector in enumerate(input_sequence):
        # CRITICAL: You can only process ONE vector at a time
        # You MUST wait for previous timestep to complete

        print(f"Timestep {timestep}: Processing word vector {word_vector}")

        # The RNN function takes:
        # 1. Current word vector
        # 2. Previous hidden state (memory from all previous words)
        hidden_state = rnn_cell(word_vector, hidden_state)

        print(f"Updated hidden state: {hidden_state}")
        # This hidden state now contains "compressed memory"
        # of all words seen so far

    return hidden_state  # Final understanding of entire sentence

What Actually Happens in Memory:

python
# Timestep 0: "The"
hidden_state = [0.1, 0.2, 0.05, ...]  # Just "The" information

# Timestep 1: "cat"
# RNN tries to combine "cat" + memory of "The"
hidden_state = [0.3, 0.1, 0.4, ...]   # "The cat" compressed

# Timestep 2: "sat"
# RNN tries to combine "sat" + memory of "The cat"
hidden_state = [0.2, 0.7, 0.1, ...]   # "The cat sat" compressed

# Problem: The original "The" information is getting weaker and weaker
# It's like lossy compression - you lose details from early words

The New Way: Transformers - Parallel Processing

Same Input, Different Processing:

python
def process_with_transformer(input_sequence):
    # KEY INSIGHT: Process ALL word vectors simultaneously

    # Convert sequence to matrix (all words at once)
    input_matrix = np.array([
        [0.1, 0.2, 0.3, ...],  # "The"
        [0.4, 0.5, 0.6, ...],  # "cat"
        [0.7, 0.8, 0.9, ...],  # "sat"
        [0.2, 0.1, 0.4, ...],  # "on"
        [0.3, 0.6, 0.2, ...]   # "mat"
    ])

    # Shape: (5 words, 512 dimensions)
    print(f"Processing matrix of shape: {input_matrix.shape}")

    # ALL words are processed in parallel - no loop!
    output_matrix = transformer_block(input_matrix)

    return output_matrix

The Attention Mechanism (The Core Innovation):

python
def transformer_block(input_matrix):
    # input_matrix shape: (num_words, embedding_dim)
    # Let's say: (5, 512) for our 5-word sentence

    # Step 1: Create Query, Key, Value matrices
    Q = input_matrix @ W_query    # (5, 512) @ (512, 64) = (5, 64)
    K = input_matrix @ W_key      # (5, 512) @ (512, 64) = (5, 64)
    V = input_matrix @ W_value    # (5, 512) @ (512, 64) = (5, 64)

    # Step 2: Calculate attention scores
    # This is WHERE THE MAGIC HAPPENS
    attention_scores = Q @ K.T    # (5, 64) @ (64, 5) = (5, 5)

    # This 5x5 matrix tells us how much each word should
    # pay attention to every other word:
    #
    #           The   cat   sat   on    mat
    #    The  [[0.1,  0.2,  0.05, 0.1,  0.05],
    #    cat   [0.3,  0.8,  0.4,  0.1,  0.2 ],
    #    sat   [0.1,  0.9,  0.6,  0.7,  0.8 ],  # "sat" pays attention to cat(0.9) and mat(0.8)
    #    on    [0.05, 0.2,  0.7,  0.4,  0.9 ],
    #    mat   [0.1,  0.3,  0.8,  0.6,  0.7 ]]

    # Step 3: Apply attention to values
    attention_weights = softmax(attention_scores)  # Normalize to probabilities
    output = attention_weights @ V                 # (5, 5) @ (5, 64) = (5, 64)

    return output

The Key Engineering Differences

1. Computational Complexity:

python
# RNN: Sequential - cannot parallelize the timestep loop
for i in range(sequence_length):
    hidden_state = rnn_cell(input[i], hidden_state)  # BLOCKS here

# Transformer: Parallel - matrix operations can use all GPU cores
output = transformer_block(input_matrix)  # Single matrix operation

2. Memory Access Pattern:

python
# RNN: Each word can only access compressed memory of previous words
# Word 5 cannot directly "see" word 1 - it's been compressed away

# Transformer: Each word has direct access to every other word
# attention_scores[4][0] = how much word 5 attends to word 1
# No information loss!

3. Gradient Flow (Training):

python
# RNN: Gradients must flow backward through the entire sequence
# Error from word 5 → word 4 → word 3 → word 2 → word 1
# Gets weaker at each step (vanishing gradient problem)

# Transformer: Direct connections between all words
# Error from word 5 can flow directly to word 1
# Much better gradient flow = better training

What "All Sentence at Once" Really Means

Instead of this (RNN):

python
result = process_word_1(word1, empty_memory)
result = process_word_2(word2, result)
result = process_word_3(word3, result)
result = process_word_4(word4, result)
result = process_word_5(word5, result)

You do this (Transformer):

python
# All words in a single matrix operation
input_matrix = stack([word1, word2, word3, word4, word5])
result = transformer_function(input_matrix)  # Processes all simultaneously

The "attention" mechanism is what allows each word to selectively focus on relevant other words, even though they're all processed in parallel.

Analogy for Engineers

RNN = Processing a linked list - you must traverse sequentially

python
current = head
while current:
    process(current.data)
    current = current.next  # Must wait for each step

Transformer = Processing an array with random access + a smart indexing system

python
# Process all elements in parallel
results = parallel_map(process_function, array)

# But with a sophisticated attention mechanism that lets each element
# dynamically decide which other elements are relevant to it

The breakthrough insight: You don't need sequential processing for language understanding. You just need a way for each word to figure out which other words matter to it - and that's exactly what attention provides.