5.2. Retention and cleanup

The TOA server enforces a single, time-based retention policy on imports: any per-date folder older than toa-server.dataRetention is deleted automatically by a background sweep. Operators do not need to write cleanup scripts for the common case - the sweep covers it.

5.2.1. How the sweep works

A daemon thread inside the server runs the sweep:

  • once when the application context finishes starting up (so a long downtime catches up immediately on restart),

  • and then once every 24 h for as long as the process lives.

For each configured domain the sweep lists the entries directly under <dataRoot>/<code>/ and deletes every subfolder whose name parses as yyyy-MM-dd and whose date is strictly before today - dataRetention. The deletion is recursive (the entire date folder, including all its imports, document folders, page binaries and metadata sidecars). Today’s date folder is never touched.

5.2.2. Tuning the window

The window is set in Server settings (toa-server.dataRetention, default 30 days). Choose it from the longest interval over which you realistically need to investigate a failed import or re-export an already submitted one. After the window expires, the data is gone - plan accordingly.

Setting dataRetention: 0d disables the sweep. Use this only if a separate process (e.g. a snapshotting backup tool) takes over the cleanup; otherwise dataRoot will grow without bound.

5.2.3. What the sweep does not delete

  • The cached templates XML (<dataRoot>/<code>/cmserver2.xml) - it has no date in the name and is regenerated by Template catalogue.

  • Date folders under domains that are no longer in toa-server.domains[]. If you remove a domain from the configuration, its data folder stays on disk in full. Delete it manually if you want the space back.

  • Anything outside dataRoot.

5.2.4. Manual cleanup

Because the on-disk layout is plain dated folders (see On-disk layout), an operator can supplement or replace the automatic sweep with standard tools:

# Same effect as a one-shot sweep with N=14:
find /var/lib/toa-server/data/<domain>/ -maxdepth 1 -type d \
     -regex '.*/[0-9]{4}-[0-9]{2}-[0-9]{2}' \
     -mtime +14 -exec rm -rf {} +

This is also the way to recover space after a domain has been removed from configuration: the automatic sweep no longer touches that path, so a one-shot rm -rf on the whole domain folder is the right tool.

5.2.5. Backups

The retention policy is destructive and runs without confirmation. If you need a longer-term archive, snapshot dataRoot to external storage on a schedule shorter than dataRetention. Filesystem-level snapshots (LVM, ZFS, btrfs) are the cheapest option because the on-disk layout is append-only at the per-import level - no database to flush, no consistent-cut concerns beyond import.json and the adjacent page-N.bin files in the same folder.