Cache consistency with logical streaming replication

3 minute read

The fundamental theorem of software engineering is that we can solve any problem by introducing an extra level of indirection. To paraphrase, if a typical n-tier application is slow, put a cache (indirection) in front of the database and the problem is solved.

However, there are two hard things in computer science: cache invalidation, naming things and off-by-1 errors. When a value in the database is updated, any cached value becomes inconsistent with the database. Two plus two is now five.

This article explores solving the cache invalidation problem with PGEC a real-time in memory database cache released under the Apache license. It uses PostgreSQL logical streaming replication to remain consistent with the database.

Background

Firstly a review of one of the most common database caching strategies.

Cache-aside (Lazy Loading)

When an application needs to read data, a common pattern is to check a local cache first, and only on a cache miss to then query the database.

On a cache miss, the database is queried for the current value, which is then used to populate the cache and return to the client. This strategy can result in the database running hot until the cache populates.

When a value in the database is updated, any cached value becomes inconsistent with the database. A mitigating strategy can be to evict values in the cache, ensuring that they are frequently refreshed from database. The eviction can be driven by a time to live (TTL), or on a least recently used (LRU) basis, or a combination of strategies.

PostgreSQL and Logical Replication

PostgreSQL is a powerful, open source object-relational database system, and in development since 1986. Logical replication is a method of replicating data objects and their changes, based upon their replication identity (either a primary or composite key).

Logical replication uses a publish and subscribe pattern with one or more subscribers subscribing to one or more publications.

The replication process creates a transaction checkpoint ensuring data integrity. Once the initial data has been collected, logical streaming replication starts, receiving changes that have been applied subsequent to the checkpoint. Changes on the publisher are sent to the subscriber as they occur in real time. They are applied in the order in which commits were made on the publisher so that transaction consistency is guaranteed.

All data from published tables will be replicated. Since PostgreSQL 15, the replicated data can be reduced by using a row filter or a column list.

Cache consistency with logical streaming replication

PGEC is an in-memory cache updated in real-time using logical streaming replication with a HTTP and memcached API. When data changes it is updated in the cache in real-time. Data is available immediately in the cache without requiring lazy loading. Values are removed as they are deleted from the database without TTL or LRU expiry.

High Level Architecture

PGEC is available as a docker container for both AMD64 and ARM64 processors. Lets take a look at simple deployment by cloning the repository containing the docker compose.yaml with sample PostgreSQL data installed and Grafana setup ready to go!

git clone https://github.com/shortishly/pgec.git

Alternatively, with the Github CLI installed use:

gh repo clone shortishly/pgec

demo

Change to the newly cloned directory:

cd pgec

Start up everything with:

docker compose --profile all up --detach --remove-orphans

Sample data is populated from the scripts in this directory, using this publication. The compose includes a small load generator using table randload. Grafana dashboards: http://localhost:3000/.

Replication Dashboard

The grades table is populated with data from:

"Last name","First name","SSN","Test1","Test2","Test3","Test4","Final","Grade"
"Alfalfa","Aloysius","123-45-6789",40.0,90.0,100.0,83.0,49.0,"D-"
"Alfred","University","123-12-1234",41.0,97.0,96.0,97.0,48.0,"D+"
"Gerty","Gramma","567-89-0123",41.0,80.0,60.0,40.0,44.0,"C"
"Android","Electric","087-65-4321",42.0,23.0,36.0,45.0,47.0,"B-"
"Bumpkin","Fred","456-78-9012",43.0,78.0,88.0,77.0,45.0,"A-"
"Rubble","Betty","234-56-7890",44.0,90.0,80.0,90.0,46.0,"C-"
"Noshow","Cecil","345-67-8901",45.0,11.0,-1.0,4.0,43.0,"F"
"Buff","Bif","632-79-9939",46.0,20.0,30.0,40.0,50.0,"B+"
"Airpump","Andrew","223-45-6789",49.0,1.0,90.0,100.0,83.0,"A"
"Backus","Jim","143-12-1234",48.0,1.0,97.0,96.0,97.0,"A+"
"Carnivore","Art","565-89-0123",44.0,1.0,80.0,60.0,40.0,"D+"
"Dandy","Jim","087-75-4321",47.0,1.0,23.0,36.0,45.0,"C+"
"Elephant","Ima","456-71-9012",45.0,1.0,78.0,88.0,77.0,"B-"
"Franklin","Benny","234-56-2890",50.0,1.0,90.0,80.0,90.0,"B-"
"George","Boy","345-67-3901",40.0,1.0,11.0,-1.0,4.0,"B"
"Heffalump","Harvey","632-79-9439",30.0,1.0,20.0,30.0,40.0,"C"

Betty Rubble’s grades are http://localhost:8080/pub/grades/234-56-7890:

curl -s http://localhost:8080/pub/grades/234-56-7890 | jq
{
  "final": 46,
  "first": "Betty",
  "grade": "C-",
  "last": "Rubble",
  "ssn": "234-56-7890",
  "test1": 44,
  "test2": 90,
  "test3": 80,
  "test4": 90
}

A ‘C-‘ seems harsh, lets give her a ‘C’ instead:

docker compose exec \
    --no-TTY \
    postgres \
    psql \
    --command="update grades set grade='C' where ssn='234-56-7890'"

Fetching the same row, but with the memcached API instead:

echo 'get pub.grades.234-56-7890' | nc -c localhost 11211

Will return:

VALUE pub.grades.234-56-7890 0 120
{"final":46,"first":"Betty","grade":"C","last":"Rubble","ssn":"234-56-7890","test1":44,"test2":90,"test3":80,"test4":90}
END

To retrieve the whole table via the REST API:

curl -s http://localhost:8080/pub/grades | jq

Primary keys:

curl -s http://localhost:8080/pub/deniro/Casino | jq
{
  "score": 80,
  "title": "Casino",
  "year": 1995
}

Composite keys:

curl -s http://localhost:8080/pub/cities/Tulsa/OK | jq
{
  "city": "Tulsa",
  "ew": "W",
  "lat_d": 36,
  "lat_m": 9,
  "lat_s": 35,
  "lon_d": 95,
  "lon_m": 54,
  "lon_s": 36,
  "ns": "N",
  "state": "OK"
}

Summary

pgec is an in-memory cache with a REST and memcached compatible API, released under the Apache license. It remains consistent with the underlying PostgreSQL database in real-time by using streaming replication.