Skip to content
protobuf.kmcd.dev

Advanced

03_TRACK_A

Schema Modeling

These sections cover the schema features used to model larger APIs: imports, dynamic payloads, partial updates, and service definitions.

You can use definitions from other .proto files using the import statement. However, managing these paths correctly is one of the most common points of friction in Protobuf development.

Import Resolution

The standard protoc compiler requires you to manually specify every include directory via -I (or --proto_path) flags. The compiler resolves files based on the current working directory combined with these flags.

If your import paths are inconsistent across your project (e.g., one file uses import "proto/user.proto" and another uses import "user.proto"), the compiler will treat them as entirely different types. This commonly leads to baffling "Duplicate Symbol" errors.

To avoid these issues, always import using the fully qualified path from the root of your project or your --proto_path.

Buf eliminates this by using a buf.yaml to define your deterministic module root. It handles imports and paths gracefully and allows for remote dependencies (similar to NPM/Cargo).

COMMON/V1/USER.PROTO
edition = "2023";
package common.v1;

message User {
  string id = 1;
  string name = 2;
}
AUTH/V1/SERVICE.PROTO
edition = "2023";
package auth.v1;

import "common/v1/user.proto";

message LoginResponse {
  common.v1.User user = 1;
  string session_token = 2;
}
TERMINAL
# Set the root as the import path (-I .)
# This forces imports to use fully qualified paths
protoc -I . \
  --go_out=. \
  auth/v1/service.proto

The Any type allows you to include messages where the schema isn't known at compile time.

google.protobuf.Any embeds an arbitrary serialized Protobuf message along with a URL that identifies its type (e.g., type.googleapis.com/mypackage.MyMessage).

When serialized to ProtoJSON, this type identifier is rendered as a special @type property alongside the standard JSON fields of the embedded message, allowing parsers to route the payload correctly.

ANY_PAYLOAD
// In Proto:
import "google/protobuf/any.proto";

message Event {
  google.protobuf.Any payload = 1;
}

// In ProtoJSON:
// {
//   "payload": {
//     "@type": "type.googleapis.com/demo.User",
//     "name": "Hiro"
//   }
// }

If you are working with dynamic protobuf messages, use Any. However, if you are working with arbitrary structured JSON data that we don't want to model or is completely dynamic (like a schema-less JSON object), use google.protobuf.Value or google.protobuf.Struct.

A Value represents a dynamically typed value which can be either a null, a number, a string, a boolean, a recursive struct (object), or a list of values. It perfectly maps to any valid JSON structure.

Use this sparingly, as it defeats the purpose of Protobuf's strong typing, but it's essential for integrating with schemaless NoSQL databases or passing untyped metadata blocks.

VALUE_PAYLOAD
// In Proto:
import "google/protobuf/struct.proto";

message Event {
  // Represents any arbitrary JSON value
  google.protobuf.Value metadata = 1;
  
  // Represents specifically a JSON object
  google.protobuf.Struct custom_attributes = 2;
}

// In ProtoJSON:
// {
//   "metadata": "simple string or object",
//   "custom_attributes": {
//     "dynamic_key": [1, 2, 3],
//     "enabled": true
//   }
// }

google.protobuf.FieldMask is a well-known type used to identify a subset of fields in a request.

It is extremely useful for partial updates (PATCH), allowing a client to send only the modified fields instead of the entire object.

Beyond updates, FieldMask is a powerful tool for tuning read responses. You can design a single List or Get response that supports many optional fields and associations (e.g., user.profile, user.settings). The client passes a read_mask to tell the server exactly which subset of data to return, eliminating "over-fetching" without needing multiple specialized endpoints.

Important: FieldMasks are not automatic. They are just a list of strings. The server must explicitly use the mask to filter database queries or prune the response message before sending.
READ_UPDATE_MASKS
import "google/protobuf/field_mask.proto";

message GetUserRequest {
  string id = 1;
  // Client requests only specific fields
  // e.g. ["name", "email", "metadata.last_login"]
  google.protobuf.FieldMask read_mask = 2;
}

message UpdateUserRequest {
  User user = 1;
  // Client identifies which fields to update
  google.protobuf.FieldMask update_mask = 2;
}

The service keyword is used to define RPC (Remote Procedure Call) interfaces. Frameworks like gRPC or Connect use these definitions to generate client and server code.

Services support four types of communication:

  • Unary: Simple request-response.
  • Streaming: Send or receive sequences of messages in a single call (Client, Server, or Bidirectional).

Note: While Protobuf provides the language to define these interfaces, the underlying networking protocols and implementation frameworks (like gRPC or Connect) are a broad topic and are out of scope for this guide.

SERVICE_DEFINITION
service UserService {
  // Unary: One request, one response
  rpc GetUser(GetUserRequest) returns (User);

  // Server Stream: One request, many responses
  rpc ListUsers(ListUsersRequest) returns (stream User);

  // Bidirectional Stream: Real-time chat
  rpc Chat(stream Message) returns (stream Message);
}

03_TRACK_B

Schema Evolution

Compatibility is the difficult part of long-lived Protobuf systems. These sections focus on what can change, what must be reserved, and how presence affects API behavior.

Protobuf is strictly designed for forward and backward compatibility. However, there are strict rules about what you CANNOT change.

As long as you follow the rules, old clients can read new messages (ignoring unknown fields), and new clients can read old messages (using default values for missing fields).

Automated Enforcement

Tools like buf breaking automate these checks by comparing your local changes against a previous version (e.g., your main branch) and failing if any wire-breaking changes are detected.

CLI_BREAKING_CHECK
// Check for breaking changes against main branch
$ buf breaking --against .git#branch=main

// Example failure output:
// user.proto:10:3: Field "1" changed type
//   from "string" to "int32".
// user.proto:12:3: Previously present
//   field "3" deleted.

Protobuf Editions unifies proto2 and proto3, allowing features to be toggled individually rather than through major syntax version upgrades.

Editions allows for smooth migrations and fine-grained control over behaviors:

  • Field Presence: Choose between IMPLICIT (proto3 default) or EXPLICIT (proto2 default).
  • Enum Type: OPEN enums allow unknown values, while CLOSED enums treat them as invalid.
  • Repeated Encoding: Standardize on PACKED (for efficiency) or EXPANDED (for compatibility).

This shift represents a fundamental change in the Protobuf lifecycle. By decoupling features from syntax versions, Editions provides a path for the ecosystem to evolve more rapidly. This approach allows new features to be introduced as optional behaviors without the disruption of a global "proto4" release.

EDITION_CONFIG
edition = "2023";

// Globally enforce field presence
option features.field_presence = EXPLICIT;

message User {
  // Optional fields are back
  string name = 1;
  
  // Mixed behavior in one file!
  int32 age = 2 [features.enum_type = OPEN];
}

Implicit vs. Explicit

Field presence determines whether a receiver can distinguish between a field that was never set and one that was set to its default value (like 0 or ""). In short, implicit presence saves space by never sending default values, while explicit presenceincludes extra tracking to definitively tell you if a field was populated.

The Historical Context

In proto2, all fields were explicit. In proto3, the optional keyword was initially removed for scalar fields to simplify the wire format and generated code. This meant all scalars had implicit presence: if you didn't send a value, the receiver saw the default.

The Modern Solution

Due to widespread demand, the optional keyword was re-introduced in later versions of proto3 (v3.15+). Today, Protobuf Editions provides the most robust solution by allowing you to globally or locally toggle field_presence between IMPLICIT and EXPLICIT.

PRESENCE_COMPARISON

File-Level Default

edition = "2023";
// Set EXPLICIT presence for the entire file
option features.field_presence = EXPLICIT;

message Profile {
  string bio = 1;   // Explicit (tracked)
  int32 views = 2; // Explicit (tracked)
}

Field-Level Overrides

message LegacyData {
  // Override to IMPLICIT for specific fields
  int32 raw_id = 1 [features.field_presence = IMPLICIT];
  
  // Follows file-level default (EXPLICIT)
  string note = 2;
}

The Evolution of Required

The required keyword was famously removed in proto3. This was a deliberate architectural decision to ensure that schemas could evolve safely without breaking backward compatibility.

Why was it removed?

If a field is marked required, it must be present in every message. If you later decide to stop sending that field, every older client in the world will crash when they try to decode the new message. Required fields are considered harmful for long-term schema evolution.

Modern Best Practices

  • Application Validation: Use generated getters that return zero values if the field is missing (e.g., Go's GetField()) and perform null checks in your business logic.

  • Metadata Validation: Use extensions like protovalidate to declare constraints (including required) in the IDL without breaking wire compatibility.

METADATA_VALIDATION
import "buf/validate/validate.proto";

message CreateUserRequest {
  // Required at the validation layer
  // but optional at the wire layer.
  string email = 1 [
    (buf.validate.field).string.email = true,
    (buf.validate.field).required = true
  ];
}
APPLICATION_VALIDATION (Go)
// Safe access even if req is nil
email := req.GetEmail()
if email == "" {
    return status.Error(InvalidArgument, "email is required")
}

The Hard Limit

The absolute maximum size of a serialized protobuf message is 2 GiB. This is a hard architectural limit because the protocol relies on 32-bit signed integers to encode byte lengths and offsets. If a payload exceeds this size, standard parsers will throw an overflow error and refuse to read it.

The Typical Size

Protobuf is optimized for small, fast payloads. The official recommendation is to keep messages under a few megabytes. In practice, the ideal size is typically under 1 MB.

Once a message grows beyond 10 MB, the CPU and memory costs of parsing become highly noticeable. For moving large datasets, the standard pattern is to chunk the data into a stream of smaller messages.

MEMORY_BEHAVIOR

Full Graph Parsing

Protobuf is fundamentally designed around the expectation that you will load the entire message into memory at once.

When you deserialize a payload, the parser reads the entire binary stream and instantiates a complete object graph.

In-Memory Expansion

As with most serialization formats, the resulting in-memory representation is significantly larger than the serialized binary. Pointers, object overhead, and data structure padding can cause memory usage to be several times the size of the original payload.