.. _operate-data-layout: ================ On-disk layout ================ Everything the TOA server keeps for itself lives under ``toa-server.dataRoot`` (see :ref:`configure-server`). The layout is plain folders and files - no database - so an operator with shell access can answer support questions and do manual cleanup with ordinary tools. Top-level structure =================== :: / / one folder per configured domain cmserver2.xml cached templates (URL-based domains only) yyyy-MM-dd/ one folder per server-local calendar day HHmm-xxxxxxxx/ one folder per import (HHmm + 8 hex chars) import.json doc-1/ pages/ page-1.bin page-1.meta.json EML pages only page-2.bin ... doc-2/ sibling document, see below pages/ page-1.bin ... The per-domain folder is created on startup. The ``yyyy-MM-dd`` folder is created the first time an import lands on that calendar day. The ``HHmm-xxxxxxxx`` folder is created when the add-in calls ``POST /import/``. Nothing else writes into ````. The ``yyyy-MM-dd`` and ``HHmm`` parts use the **server-local time zone**. The same zone governs the retention cutoff, so a misconfigured zone has visible consequences both here and in :ref:`operate-retention`. Pin the JVM time zone explicitly on every TOA server instance - see :ref:`configure-server-timezone`. Import identifier ================= The API-level import id has the form:: yyyy-MM-dd_HHmm-xxxxxxxx The two halves correspond directly to the date folder and the import folder on disk. Given an id, an operator can locate the import on disk without searching:: //// The ``xxxxxxxx`` suffix is 8 random hex characters; it makes the id unguessable for download URLs and keeps imports unique within the same minute. What each file means ==================== ``import.json`` Single source of truth for an import. Operator-relevant fields: * ``status`` - ``DRAFT``, ``PENDING``, ``SUBMITTED`` or ``FAILED``. ``FAILED`` carries an ``error`` message; ``SUBMITTED`` carries ``damisBatchId`` for cross-referencing the storage server. * ``userName`` / ``userEmail`` - whoever created the import in Outlook. ``userEmail`` is also the ownership key enforced on subsequent mutations. * ``documents[]`` - the list of documents in this import; each entry carries its template id, attribute values and ``pages[]`` metadata (filename, byte size, content type, sidecar filename if any). The file is rewritten atomically (write to ``import.json.tmp``, then ``ATOMIC_MOVE``). If you ever see ``import.json.tmp`` left over, the server crashed mid-rewrite - it is safe to delete; the previous ``import.json`` is intact. ``doc-N/`` One folder per document inside the import. ``doc-1`` always exists and corresponds to the original message the user uploaded. ``doc-2``, ``doc-3``, ... are sibling documents created by the "extract attachments" flow - each holds the attachments split out of an EML page in ``doc-1`` (or a later sibling). ``doc-N/pages/page-M.bin`` Raw page payload. The byte stream is whatever the client posted - typically an EML message for ``page-1`` of ``doc-1``, or a single extracted attachment for sibling-document pages. The ``.bin`` extension is intentional; the semantic content type lives in ``import.json`` and (for EML) in the sidecar. ``doc-N/pages/page-M.meta.json`` Sidecar produced for ``message/rfc822`` pages only. Contains the decoded ``from`` / ``subject`` and the list of MIME attachments (filename, content type, decoded size). It is a convenience index - if the sidecar is missing or unreadable, the ``.bin`` is still the source of truth and the server falls back to re-parsing on demand. Sidecars are rewritten when attachments are extracted, so they always match the on-disk EML. ``cmserver2.xml`` Only present for domains whose templates are loaded from a URL (see :ref:`configure-templates`). Cached copy of the last successfully downloaded catalogue; used as the fallback when the next refresh fails. Safe to delete - the next refresh re-downloads it. If you delete it while the remote URL is also unreachable, the domain has no catalogue until either the URL recovers or you drop in a copy by hand. Atomicity guarantees ==================== The layout is designed so that an operator's mental model matches the filesystem state without race conditions: * A page binary file existing on disk implies the upload completed. Interrupted uploads leave no ``page-N.bin`` at all - never a half-written one. The controller streams the request body to a sibling ``.tmp`` file and ``ATOMIC_MOVE`` it into place. * ``import.json`` and the page binaries inside the same import folder are mutated under a per-import lock held by ``DomainStorage``, so concurrent ``addPage`` / ``createDocumentFromAttachments`` / ``submit`` calls cannot interleave their writes. * Atomicity is per-import. Two different imports under the same date folder are independent; backing up or deleting one never affects the other. Manual operations ================= Because the layout has no database the following are all safe shell operations, as long as the server is not actively writing to the target import: * **Inspect** an import: ``cat /import.json``, ``ls /doc-*/pages/``. * **Archive** an import: ``tar`` or ``zip`` the ``HHmm-xxxxxxxx`` folder. The server will not notice it is gone until the next API call references it. * **Delete** a single import: ``rm -rf`` the ``HHmm-xxxxxxxx`` folder. The corresponding API id will then return 404. * **Bulk delete** old date folders: see :ref:`operate-retention`. Do **not** rename folders or hand-edit ``import.json`` while the server is running - the per-import lock is in-process only and external renames will be observed mid-operation. Stop the server first.