2025-07-18

fast delimiter-based path extraction from standard input

an algorithm for extracting delimiter-separated path strings

use case

reading a sequence of path strings from standard input where:

  • each path ends with a delimiter character
  • newlines or other separator characters may be used as delimiters
  • memory is allocated dynamically to handle unknown input size
  • efficient in-place storage and referencing of paths is desired

the algorithm

  • input buffering:

    • allocate an initial buffer of fixed size to hold the input

    • read from standard input in chunks, appending to the buffer

    • dynamically resize the buffer as needed using realloc

    • reading stops at end of input (eof) or on read error

  • path pointer array:

    • allocate an initial array of pointers to hold extracted path references

    • resize the array dynamically as more paths are discovered

  • delimiter parsing:

    • iterate through the input buffer to locate delimiter characters

    • for each delimiter found:

      • if the region before the delimiter is non-empty, store a pointer to its start
      • replace the delimiter with a null terminator to isolate the path string
      • advance to the character following the delimiter
    • stop if no more delimiters are found or buffer end is reached

  • memory layout:

    • all extracted path strings are stored null-terminated in a contiguous buffer

    • the array of pointers references each of these strings directly within the buffer

    • the caller is responsible for freeing the returned buffer and pointer array

behavior and constraints

  • all paths must be terminated by the specified delimiter
  • empty entries (zero-length between delimiters) are ignored
  • the function exits silently if no input is received
  • memory errors invoke a termination routine (memory_error)
  • read errors invoke a termination routine (handle_error)

efficiency considerations

  • memory allocations are performed using exponential growth for both buffer and pointer array, reducing realloc overhead
  • only one pass is made through the input buffer to identify all paths
  • delimiter replacement with null bytes avoids copying path data and enables in-place storage
  • minimal dynamic memory usage beyond the input and pointer array

conclusion

this algorithm provides an efficient way to:

  • extract and reference variable-length, delimiter-separated paths from standard input
  • minimize memory copying through in-place modification
  • scale to large inputs through dynamic resizing
  • maintain low overhead and deterministic behavior in a simple, single-pass design

example implementation in c

char* get_paths(char delimiter, char*** paths, size_t* paths_used) {
  // read delimiter separated paths from standard input.
  // all paths must end with the delimiter.
  // data will contain the full string read from standard input with newlines replaced by null bytes,
  // while paths contains pointers to the beginning of each path in data.
  size_t data_size = paths_data_size_min;
  size_t data_used = 0;
  size_t paths_size = paths_size_min;
  ssize_t read_size;
  char* data = malloc(paths_data_size_min);
  char* p;
  char* data_end;
  if (!data) memory_error;
  *paths = malloc(paths_size_min * sizeof(char*));
  if (!*paths) memory_error;
  *paths_used = 0;
  while (0 < (read_size = read(0, data + data_used, paths_data_size_min))) {
    data_used += read_size;
    if (data_size < data_used + paths_data_size_min) {
      data_size *= 2;
      data = realloc(data, data_size);
      if (!data) memory_error;
    }
  }
  handle_error(read_size);
  if (!data_used) exit(0);
  p = data;
  data_end = data + data_used;
  while (p < data_end) {
    char* next_delim = memchr(p, delimiter, data_end - p);
    if (!next_delim) break;
    if (next_delim > p) {
      if (paths_size == *paths_used) {
        paths_size *= 2;
        *paths = realloc(*paths, paths_size * sizeof(char*));
        if (!*paths) memory_error;
      }
      *next_delim = 0;
      (*paths)[*paths_used] = p;
      (*paths_used) += 1;
    }
    p = next_delim + 1;
  }
  return(data);
}