Files
seismo-relay/docs/idf_protocol_reference.md

342 lines
16 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# IDF Protocol Reference — Thor / Micromate Series IV
Starting-point reference for reverse-engineering Instantel's Micromate
Series IV event-file format. Sibling to
[instantel_protocol_reference.md](instantel_protocol_reference.md) (the
Series III "Rosetta Stone") — this doc holds what we know so far and
the open questions still to crack.
**Status (2026-05-28):** ASCII text sidecar fully decoded (1,014
sample files round-trip). **Thor IDFW** binary now decodes via
`micromate.idf_file.read_idf_file()` — reuses the BW segment-rotated
block codec verbatim at fixed body offset `0x0f1f`; metadata (serial,
timestamp, sample_rate, record_time, calibration_date) extracted from
the binary header. Sample fidelity is 8799% byte-exact on quiet
events; loud events hit the BW codec's known walker-stops-early
limitation. Residual ~3% drift on per-sample deltas (likely a
Thor-specific 12-bit delta refinement not yet modelled).
**Thor IDFH histograms also decoded.** Body has one or more segments;
each 12-byte segment header `[length_be 2B][0a 00 00 00][00 NN][05 3f]`
introduces `N = (length - 10) // 72` interval records of 72 bytes
each. Each interval = 4 × 16-byte per-channel records:
`[int16 min][int16 max][int16 ??][uint16 halfp][2B 00][uint16 ??][2B 00][uint16 ??]`.
Geo peak `= max(|min|, |max|) / 32768 × 10` in/s (matches sidecar
~1.8%); freq `= 512 / halfp` Hz (None for halfp ≤ 5 → ">100"
sentinel). Corpus: **all 859 Thor IDFH files decode, 181,071
intervals**. Wired through `read_idf_file()`
`save_imported_idf()` → sidecar's `extensions.idf_intervals`.
**Note on the BE9439 outliers in the example corpus:** Two files
(`BE9439_20200713131747.IDFW` and `BE9439_20200713124251.IDFH`) are
**Series III Blastware** binaries, not Thor. Provenance: TMI tried
to use Thor to manage auto-call-homes for Series III units; the
experiment didn't work out, but it did leave a few BW event files
in Thor's per-serial directory structure with `.IDFW`/`.IDFH`
extensions — Thor's forwarder applied its own naming convention to
the BW bodies it was relaying. Their header `10 00 01 80 00 00
Instantel STRT ff fe <end_key> <start_key>` is the BW SUB 5A STRT
record, not a Thor body preamble. The reader detects them by
signature and raises `NotImplementedError` pointing callers at
`read_blastware_file()`, which extracts BW-format peaks from them.
**Still NYI for Thor IDFH:** per-channel `int16 field4` (possibly
time-of-peak); the two uint16 fields (probably PVS contributions);
8-byte interval tail (PVS data); mic dB(L) exact conversion constant.
### Codec breakthroughs (2026-05-28)
- **Body offset is a fixed `0x0f1f`** across 151/154 corpus IDFW
files. Preceded by a 4-byte record-type marker (`46 00 00 00`)
+ magic preamble `00 02 00 [Tran[0] BE] [Tran[1] BE]`.
- **Sample stream is BW's segment-rotated block codec verbatim.**
Thor reuses `10 NN` (nibble), `20 NN` (int8), `00 NN` (RLE),
`30 NN` (packed12), `40 02` (segment header) tags with the same
semantics. Channel rotation Tran→Vert→Long→MicL.
- **Geo LSB = 0.0003 in/s** (not BW's 0.005), because Thor's 16-bit
ADC range maps to 10 in/s without the 16-count BW quantization step.
- **Mic ≈ 2.14×10⁻⁶ psi/count** (rough scale; refine after channel
block calibration constants are decoded).
- **BW compliance anchor `\xbe\x80\x00\x00\x00\x00` reappears at
IDFW offset 0x952** — sample_rate at anchor6 (uint16 BE),
record_time at anchor+6 (float32 BE), same layout as BW.
- **Event timestamp at offset 0x97A** — 8 bytes `[day][month]
[year_be][unk][hour][min][sec]`. Stop-time mirrors at 0x982.
- **Serial as null-terminated ASCII at 0x14E**.
- **Calibration date** at 0x1940x197 (day, month, year_be).
- Per-sample residual drift of ~3% suggests Thor encodes int8/nibble
deltas with an extra refinement bit that BW doesn't carry —
unsolved; errors resync within a few samples so cumulative impact
is small.
---
## File model
### Filename convention
```
<SERIAL>_<YYYYMMDDHHMMSS>.<KIND>
```
- **SERIAL** — literal device serial, two-letter prefix + numeric
suffix. Examples seen: `UM11719`, `UM13981`, `UM20147`, `BE9439`.
Unlike Series III BW filenames (`M529LK44.AB0`, base-36 stem),
Series IV filenames carry the serial in plain text.
- **YYYYMMDDHHMMSS** — 14-char ASCII timestamp in **device local
time** (no timezone marker).
- **KIND** — `IDFH` for histograms, `IDFW` for waveforms.
The `.IDFH.txt` / `.IDFW.txt` ASCII sidecar lives in a `TXT/`
**subfolder** of the unit's directory, not alongside the binary.
This pairing convention is encoded in
`event_forwarder.idf_report_path()`.
### Directory layout
```
C:\THORDATA\
└── <Project>\
└── <UM####>\ ← unit serial dir
├── UM12345_20260520100000.MLG ← monitor log (not events)
├── UM12345_20260520100000.IDFH ← histogram event (binary)
├── UM12345_20260520100000.IDFW ← waveform event (binary)
├── UM12345_20260520100000.IDFW.CDB ← cache-DB variant (skip)
├── TXT\
│ ├── UM12345_20260520100000.IDFH.txt ← histogram ASCII sidecar
│ └── UM12345_20260520100000.IDFW.txt ← waveform ASCII sidecar
├── CSV\, HTML\, PDF\, XML\ ← operator-facing derived exports
└── ...
```
The `.IDFW.CDB` files share the binary's basename but appear to be a
separate cache/database variant. Their first 8 bytes match the
**old**-firmware Thor signature (see below) regardless of which
signature the paired `.IDFW` uses. Purpose unknown; sizes vary
wildly (observed 123 B → 40,491 B). Thor-watcher's forwarder
deliberately skips them.
### Sample corpus
The `thor-watcher/example-data/THORDATA_example/` tree carries
**1,014 paired .IDFW / .IDFH + .txt files** spanning 20202023
across nine units (UM11719, UM13981, UM20147, …, plus BE9439 from
2020). This is the reverse-engineering ground truth.
---
## ASCII sidecar (`.IDFW.txt` / `.IDFH.txt`) — fully decoded
Shape: plain text, one `"Key : Value"` line per metadata field,
followed for waveforms by a tab-separated sample table headed by
the literal line `Waveform Data Channels`. Parsed by
[`micromate/idf_ascii_report.py`](../micromate/idf_ascii_report.py).
See [`micromate/models.py`](../micromate/models.py) for the typed
`IdfReport` shape.
### Notable conventions
- **Units are native to Thor** — geophone in **in/s**, microphone in
**dB(L)** (not psi like Series III BW reports), frequency in Hz,
acceleration in g, displacement in in.
- **Below-threshold readings** appear as the literal string
`<0.005 in/s` (155 occurrences in the sample corpus) — the parser
strips the `<` and treats the numeric remainder as the value.
- **Out-of-range / not-measured** values appear as `N/A` — parser
drops the field rather than letting the string leak into a numeric
column.
- **Firmware string** observed: `Micromate ISEE 11.0AK`.
- **TitleString1..4** are operator-defined free-text slots; Thor's
default labels map them to Location / Client / Company / Notes,
which the parser surfaces as `project` / `client` / `operator` /
`notes`.
- **Histogram sidecars** use `HistogramStartDate` / `HistogramStartTime`
in place of waveform's `EventDate` / `EventTime`. Parser falls
through to either.
- **Histogram tabular block** lacks the `Waveform Data Channels`
marker; instead it's a multi-line column header followed by
per-interval rows (`<date> <time> <tran-ppv> <freq> ...`). Parser
silently ignores lines after the metadata block since they lack a
colon-separated `key : value` shape (the timestamps DO contain
colons but produce garbage keys that don't collide with any
recognised field).
---
## Binary header signatures (observed)
Hex dump of the first 32 bytes across 1,014 sample files reveals
**two distinct file signatures**, both anchored by the literal
ASCII string `"\x00Instantel\x00"` at offset 616:
### Signature A — newer firmware (1,012 files, 99.8% of corpus)
```
00000000: 0012 0100 0000 496e 7374 616e 7465 6c00 ......Instantel.
00000010: 0000 a695 002e b500 4f70 6572 6174 6f72 ........Operator
^^^^^^^^^^^^^^^^
operator/title string starts at 0x18
```
Header bytes 05: `00 12 01 00 00 00`. Followed immediately by the
8-byte ASCII tag, then 6 unknown bytes, then ASCII operator-supplied
strings (Operator name, etc.) and on through the project / client /
title strings. No `STRT` record observed in this layout.
### Signature B — older firmware (2 files: BE9439 from 2020)
```
00000000: 1000 0180 0000 496e 7374 616e 7465 6c00 ......Instantel.
00000010: 072c 0012 0300 5354 5254 fffe 0111 2340 .,....STRT....#@
^^^^^^^^^ ^^^^^^^^^
STRT magic 4-byte end_key
00000020: 0111 0000 2e5f 00ac 4600 0000 0200 0000 ....._..F.......
^^^^^^^^^ ^^^
4-byte start_key 0x46 (BW WAVEHDR record-type marker)
```
Header bytes 05: `10 00 01 80 00 00`. The structure after the
`Instantel` magic is **byte-for-byte identical to a BW SUB 5A
probe-response STRT record** as documented in
[instantel_protocol_reference.md → "SUB 5A — STRT record encodes
end_offset"](instantel_protocol_reference.md). Specifically:
| Offset | Bytes | Meaning (per BW reference) |
|--------|---------------------|--------------------------------------|
| 0x14 | `53 54 52 54` | `STRT` magic |
| 0x18 | `ff fe` | STRT sentinel |
| 0x1A | `01 11 23 40` | `end_key` (4 bytes) |
| 0x1E | `01 11 00 00` | `start_key` (4 bytes) |
| 0x26 | `46` | `0x46` waveform-record type marker |
**Hypothesis:** Older Micromate firmware writes a wrapped BW-format
event into the `.IDFW` file — essentially the same on-disk shape as
a Series III device, with the new filename convention applied at
export time. Newer firmware (signature A) abandoned the
BW-compatible layout for an Instantel-specific format.
If that hypothesis holds, the 2 signature-B files can already be
parsed via `minimateplus/event_file_io.read_blastware_file()` — worth
testing. The 1,012 signature-A files are the real reverse-engineering
target.
### `.IDFW.CDB` cache files
Always carry signature B (`10 00 01 80 ...`), even when the paired
`.IDFW` carries signature A. Plausible explanation: the CDB is an
internal Thor cache-database export that retains the legacy BW-style
record layout regardless of the user-facing `.IDFW` format version.
Not currently consumed by the forwarder.
---
## File-size patterns (Signature A, the main target)
Survey of 1,012 signature-A files:
| Event type | Typical size | Source of variance |
|--------------|-------------------|----------------------------------------------|
| `.IDFW` 2-sec | 9,200 10,500 B | Operator-supplied strings (TitleString1..4) of varying length |
| `.IDFH` | 2,944 4,076 B | Histogram interval count (record duration / interval) |
**Naive arithmetic for 2-sec waveform:**
- 4 channels × 2 sec × 1024 sps = 8,192 samples
- At 2 bytes/sample (int16) = 16,384 sample bytes → file would be > 16 KB
- Observed: ~910 KB
- → samples are likely **1 byte each** (int8 quantised), **or** stored
with bit-packing / delta encoding, **or** only one channel's
full-rate samples are stored with the others reconstructed
arithmetically. Verifying this is the **first RE milestone**.
Project-stringlength variance (~1 KB across the corpus) is consistent
with the file carrying a single copy of each TitleString1..4 plus
operator + setup-name as null-padded ASCII regions.
---
## Open questions
The reverse-engineering targets, roughly in dependency order:
1. **Sample encoding (signature A)** — int8? int16 LE/BE? Bit-packed?
Delta-coded? Per-channel interleaved or sequential blocks?
2. **Header field layout (signature A)** — where do sample_rate,
record_time, channel count, and per-channel peaks live in the
binary? The ASCII sidecar gives the device-authoritative values,
so binary fields can be confirmed by diff.
3. **Operator-string offsets** — `Operator` at 0x18 is the first
visible string in signature-A files; the rest (project, client,
notes, setup) follow. Need to map exact offsets and null-padding
conventions.
4. **Signature-B → BW codec compatibility** — does
`minimateplus/event_file_io.read_blastware_file()` actually parse
the 2 BE9439 signature-B files as-is? If yes, the OLD-format
ingest is free.
5. **`.IDFW.CDB` purpose** — is it an internal Thor cache, a
ring-buffer dump, or something else? Worth a single small effort
to characterise so we know what we're skipping.
6. **Footer / checksum** — every BW event file has a footer; does
IDF? Where does the per-channel sample block end?
---
## Reverse-engineering playbook (when we start)
The Series III BW codec took ~2 months of MITM wire captures
because we didn't have ground-truth metadata. Thor's situation is
**substantially better**:
- **Ground truth is on disk.** Every binary in `example-data/`
has a paired `.IDFW.txt` carrying the full decoded sample table
(`Waveform Data Channels` block — see any sample file in
`thor-watcher/example-data/.../TXT/`). Aligning binary bytes
to the table's float-per-row values gives an immediate per-byte
hypothesis test.
- **Cross-event diffing.** 1,012 signature-A samples from 9 units
spanning 4 years means any field that varies between events is
immediately localisable. Fields that are constant across all
files (firmware ID, channel labels, format-version word) are also
immediately localisable by complementary search.
- **No protocol surface.** Files at rest, not a wire dialect. No
DLE stuffing, no inner-frame parsing, no probe/data two-step.
Suggested first session (2-4 hours): hand-decode `UM11719_20231219162723.IDFW`
(10,290 bytes) against its `TXT/UM11719_20231219162723.IDFW.txt`
sample table (the 2-sec waveform at 1024 sps × 4 channels = 8,192
sample rows). Find the first per-channel sample value (`0.0003` in
the Tran column at t=0) in the binary. Confirms sample encoding.
Everything else flows from there.
---
## Code seams ready to receive the codec
When the codec lands, it goes into
[`micromate/idf_file.py`](../micromate/idf_file.py) (currently a
stub raising `NotImplementedError`). Public API:
```python
from micromate import IdfEvent
from micromate.idf_file import read_idf_file
event: IdfEvent = read_idf_file(Path("UM11719_20231219163444.IDFW"))
# event.peaks.transverse_ips, event.timestamp, event.raw_samples, ...
```
The ingest pipeline (`WaveformStore.save_imported_idf`) currently
builds the `IdfEvent` from the `.txt` parser only. Once
`read_idf_file()` works, the binary becomes authoritative; the
`.txt` parser drops to fast-path metadata cross-check. Operators
who don't enable Thor's TXT exporter still get fully populated
events.
---
## See also
- [instantel_protocol_reference.md](instantel_protocol_reference.md) — Series III BW protocol reference (the Rosetta Stone). STRT record format, DLE framing, BW filename encoding.
- [`micromate/idf_ascii_report.py`](../micromate/idf_ascii_report.py) — `.txt` sidecar parser.
- [`micromate/models.py`](../micromate/models.py) — `IdfEvent`, `IdfReport` typed dataclasses.
- [`micromate/idf_file.py`](../micromate/idf_file.py) — placeholder for the binary codec.
- [`thor-watcher/example-data/THORDATA_example/`](../../thor-watcher/example-data/) — 1,014 paired binary + .txt files for codec validation.