How We Compressed 63.5 GB of Financial Tick Data to 5.5 GB

Algo Trading Specifics
Parth Khare
Parth Khare
Parth is Co-Founder and CEO of AlphaBots, he is responsible for research and new product initiative. He is also an expert algo trader and educator in this space.
May 9th, 2026 | 10 mins

At AlphaBots, we run an algorithmic trading platform that processes live market data across Indian equity and derivatives markets. Every second, we capture 1-second snapshot data and full tick data across Nifty, BankNifty, and equity instruments. It adds up fast — we're talking gigabytes of new data every single trading day, and it compounds.

We store this data for backtesting, strategy validation, and compliance. After a few months of live operation, the storage bill started hurting. More importantly, loading large Parquet files for backtesting runs was slow. We were spending more time moving data around than actually running strategies.

We tried compressing with Parquet's built-in ZSTD. It helped, but not enough. The files were still large, the storage costs were still climbing, and the fundamental problem remained: general-purpose compression isn't optimised for the specific mathematical structure of financial tick data.

So we built our own.

The Insight: Tick Data Has Exploitable Structure

Financial tick data is not random. It has properties that general-purpose compressors ignore:

Prices move in tiny increments. A Nifty futures price might go from 22,450.25 to 22,450.50 to 22,450.25. The raw float64 values look very different. But the differences — +0.25, -0.25 — are tiny and repetitive. If you store differences instead of raw values, the data collapses dramatically.

Columns are homogeneous. All prices are floats in a similar range. All volumes are integers. All timestamps are sequential. Columnar storage exploits this — you compress each column independently rather than row-by-row, which means the compressor sees 8 million prices together instead of interleaved with volumes and symbols.

Data is written once and read rarely. Unlike a hot database, tick data archives are almost never updated after writing. They get queried in batch for backtesting. This means we can afford to spend more time on compression, because decompression only happens a handful of times.

These three properties together suggested a compression pipeline that general-purpose tools weren't exploiting.

The Approach: Four Steps Before ZSTD Sees Anything

We built TSC (Time-Series Compressor) as a Rust-native engine. The pipeline works like this:

Step 1 — Columnar layout. Split the dataset into individual columns before doing anything else. Process each column independently. This ensures the compressor sees homogeneous data — all prices together, all timestamps together, all volumes together.

Step 2 — Delta encoding. For each column, store the difference between consecutive values instead of the raw value. For a price column going 22450.25, 22450.50, 22450.25, we store: 22450.25 (baseline), +0.25, -0.25. The differences are tiny integers. For timestamp columns, tick-to-tick differences are often exactly 1 second — they compress to almost nothing.

Step 3 — Bit-packing. After delta encoding, each value fits in far fewer bits than the original float64 or int64. We pack them into the minimum number of bits required. A sequence of small deltas that fit in 8 bits gets stored in 8 bits, not 64.

Step 4 — ZSTD as the final pass. Only now does ZSTD see the data — and it's working on already-small integers packed tightly, not raw floats. This is the key insight: ZSTD on pre-processed data significantly outperforms ZSTD on raw data. The pre-processing step is what beats Parquet's built-in ZSTD, not a better ZSTD configuration.

The entire pipeline runs in O(1) memory — we process data in fixed-size chunks, so the RAM usage stays constant regardless of input size.

The Results

We tested on real financial datasets from our own infrastructure and public market data. All tests are 100% lossless — every row and column verified after full round-trip.

Nifty historical tick data (~15M rows): 63.5 GB → 5.5 GB — 91.6% smaller than Parquet ZSTD

EQY US ALL BBO (8.8M rows): 118.92 MB → 30.09 MB — 74.7% smaller

Options Greeks (1M rows): 66.6% smaller than Parquet baseline

On the 63.5 GB Nifty dataset, we also compared against gzip directly: TSC produced 5.5 GB vs gzip's 7.5 GB — 27% smaller than gzip on the same data, processed with under 7 GB RAM throughout.

For AlphaBots specifically, this translated directly into meaningful storage cost reduction. Months of tick data that previously required significant S3 capacity now fits in a fraction of that space. Backtesting data loads faster. The daily ingestion pipeline runs leaner.

Honest Trade-offs

TSC is not a Parquet replacement. There are workloads where you should use Parquet instead:

Random access queries — TSC is optimised for sequential batch reads. If you're doing point queries on individual rows, the chunk-based design means you decompress more data than you need. Parquet handles this better.

Write speed — TSC's compression pipeline takes more time than Parquet's. On the 63.5 GB dataset, write time is higher than Parquet. This is a deliberate trade-off — we optimise for the archive, not the ingest.

Mixed-type data — Delta encoding doesn't help strings, categorical data, or sparse columns with many nulls. TSC falls back to ZSTD-only for those, so gains are minimal on wide tables with lots of non-numeric columns.

The sweet spot is clear: dense numeric time-series, write-once, read in batch. Financial tick data. IoT sensor telemetry. Metrics archives. If your data fits that description, TSC will outperform Parquet significantly.

Using TSC

TSC is built in Rust with Python bindings via PyO3. Zero-copy Arrow/Polars/Pandas integration. Pre-built wheels for Linux and Windows (Python 3.11 and 3.12).

The high-level API is simple:

import pandas as pd
import tsc

# Compress any DataFrame
df = pd.read_parquet("tick_data.parquet")
payload = tsc.compress(df, mode="balanced", sort_key="auto")

# Decompress back
restored = tsc.decompress(payload)

For Parquet/CSV/DuckDB workflows, the wrapper handles format conversion automatically:

from alphabots_tsc_wrapper import TSCompressor, TSDecompressor

TSCompressor(profile="balanced").compress_file("data.parquet", "data.tsc")
df = TSDecompressor().decompress_polars("data.tsc")

Try it on your own data: We have a hosted web app where you can upload a Parquet or CSV file (up to 200 MB) and see the compression results on your actual data in about two minutes — no install, no setup.

TSC Compression Service

Pre-built wheels and documentation: GitHub — adminalphabots/alphabots-tsc-engine

What's Next

We built TSC for AlphaBots' internal use. It solved our problem and the benchmarks are strong enough that we think it has broader applicability — particularly for any platform storing large volumes of financial or IoT time-series data.

We're exploring commercial licensing and IP transfer to the right home. If you're working on a TSDB, market data platform, or storage infrastructure where compression ratio matters, reach out.

TSC is free for evaluation and non-commercial use. Commercial licensing inquiries: parth.k@alphabots.in

You may also like

Logo
Revolutionize your Investment journey
youtubeInstagramtelegramLinkedin
BACK TO TOP ^