2025-07-18

line-based reading and parallel processing

this outlines a strategy for processing newline-separated data from streams or files using multiple threads.

use cases

processing large volumes of newline-separated data with:

buffers:
- two buffers, data[0] and data[1], alternate during reading
- data[0] is initially filled from the input source
newline search:
- after data is read into data[0], the algorithm locates the last newline in the second half of the buffer
- if no newline is found, the buffer is either expanded or an error is returned if a line exceeds the allowed length
line handling:
- any data after the last newline is moved to the beginning of data[1]
- the next read continues into data[1], with buffers alternating to ensure all lines are preserved and completed

input is assumed to be line-aligned, with eof or end of stream marked by a newline.

buffer chains: multiple threads process data via a sequence of buffers. only one thread reads from the input, but all threads can operate on their respective buffers
file metadata optimization:
- file size metadata is used to group small files and partition large ones efficiently
- threads may seek to predefined offsets in large files, enabling parallel file processing

the fgets() function reads a line from a stream into a buffer. while simple and portable, it introduces overhead:

buffer size constraint: predefined buffer sizes may be insufficient for long lines, requiring multiple reads
data copying: data is copied from internal to user buffers, incurring memory overhead
per-call cost: each fgets() call performs checks and incurs function overhead, reducing throughput on large data sets

modern libraries may optimize fgets() through:

however, these optimizations are insufficient for cases involving extremely large data volumes or strict memory/performance control.

this algorithm mitigates fgets() limitations by:

long lines: the buffer can grow dynamically if no newline is found. alternatively, a line-length cap can be enforced and excess data skipped
incomplete lines: partial lines at the end of a buffer are copied to the start of the next buffer to ensure continuity
thread usage: each thread processes an independent buffer to parallelize data parsing and improve throughput

synchronization: thread contention is minimized since reading is serialized but parsing is parallel
i/o vs. cpu bound: reading large chunks benefits i/o-bound applications. cpu-bound gains depend on parsing cost
error handling: partial reads, i/o failures, and buffer overflows must be explicitly managed for robust behavior

this algorithm offers benefits over fgets() for: