History of the question: In enterprise modernization initiatives, Business Analysts frequently encounter knowledge decay—a phenomenon where critical business logic survives only in unreadable legacy code. This question emerged from mainframe-to-cloud migrations where the original architects retired decades ago, leaving behind COBOL programs that execute perfectly but defy interpretation. The historical context involves the transition from monolithic batch processing to distributed microservices, where implicit state management must become explicit API contracts.
The problem centers on epistemic opacity: the system works, but nobody knows why. The COBOL codebase contains tacit business rules—edge cases, regulatory patches, and manual overrides—that were never documented because the original developers held them in memory. Current operations staff understand inputs and outputs but not the decision logic. Meanwhile, the new cloud-native architecture requires these rules to be decoupled, documented, and exposed through REST endpoints for real-time consumption. The fixed regulatory deadline prevents a multi-year archaeological dig, yet errors in rule extraction could violate GDPR data handling mandates or financial reporting accuracy.
The solution employs a triangulated reverse-engineering approach. First, conduct Event Storming workshops with operations staff to map observable business behaviors and identify "black box" processes. Second, use static code analysis tools to generate control-flow graphs of the COBOL programs, cross-referencing variable mutations with business outcomes. Third, implement parallel-running shadow mode, where the new microservice processes mirror transactions against the legacy system without production impact, flagging discrepancies for investigation. This creates a feedback loop where code archaeology validates stakeholder memories, and stakeholder context explains code anomalies.
A regional insurance company needed to replace a 1980s-era COBOL policy rating engine with a Python/FastAPI microservices suite to enable real-time mobile quotes. The original calculation logic included complex territorial risk weightings, seasonal adjustment factors, and reinsurance treaty clauses that had been patched in over forty years. The three remaining COBOL developers had retired, and the current IT staff treated the system as a "magic box" that produced correct premiums but could not explain the mathematical derivations for specific edge cases. The regulatory authority mandated migration completion within eight months to avoid unsupported infrastructure penalties.
Several approaches were evaluated to capture the requirements. The first option proposed a code-to-spec transcription, where developers would manually document every IF statement and MOVE operation in the COBOL source. The pros included theoretical completeness and preservation of exact logic. The cons were severe: the codebase contained over two million lines of spaghetti code with undocumented global variables, making this a multi-year effort that would miss the deadline and likely introduce transcription errors.
The second option suggested black-box requirement derivation, observing inputs (policy attributes) and outputs (premium amounts) to infer rules through statistical regression. The pros were speed and focus on current business value rather than technical debt. The cons included inability to detect dormant code paths for rare claims scenarios and the risk of codifying bugs as features.
The third option, behavioral archaeology with parallel validation, involved extracting sample data from five years of production logs, building decision trees from actual transactions, and validating these against the COBOL source using automated diff tools.
The team selected the third solution because it balanced velocity with accuracy while honoring the agile principle of working software over comprehensive documentation. By focusing on executed code paths rather than dead features, they reduced scope by 60% while guaranteeing that active business rules were correctly captured. They established a data lake containing anonymized historical transactions and ran these through both the legacy JCL jobs and the new FastAPI services, automatically flagging premium calculation mismatches greater than 0.01%. This revealed three critical undocumented conditions: a hurricane deductible override for Florida policies issued before 1992, a special commission calculation for retired agents, and a rounding error in quarterly tax reporting that had been "corrected" by manual spreadsheet adjustments for decades. The microservices were redesigned to explicitly handle these edge cases as configurable business rules rather than hardcoded constants.
When reverse-engineering legacy code, how do you differentiate between a critical business constraint and a technical workaround that can be safely eliminated during migration?
Candidates often assume all existing logic serves a current business purpose, falling into the sunk cost fallacy of legacy preservation. The correct approach involves temporal context analysis: examining the date stamps of code changes to correlate with known regulatory changes, mergers, or technology limitations that no longer exist. For example, a data truncation routine in COBOL might exist solely because the original DB2 schema used fixed-width fields, whereas modern PostgreSQL supports variable-length strings—eliminating the need for the truncation rule entirely. BAs must conduct intent verification sessions with business stakeholders, presenting suspected workarounds as "We can simplify this by removing X; does this affect your compliance?" rather than asking "Should we keep X?" This shifts the burden of proof to necessity rather than preservation.
How do you prevent the "cargo cult" anti-pattern where the new system replicates inefficient batch-processing workflows simply because they exist in the COBOL monolith?
Many candidates focus exclusively on functional parity without process re-engineering. The failure occurs when BAs document the current state (e.g., "The system runs a nightly batch at 2 AM") as a requirement for the future state, ignoring that event-driven architectures using Apache Kafka or RabbitMQ can enable real-time processing. The solution requires capability mapping: separating the "what" (risk calculation must occur) from the "how" (batch vs. streaming). BAs should perform value stream mapping to identify wait times in the batch schedule that served operational convenience rather than business rules. By demonstrating that REST endpoints can provide immediate feedback to underwriters—reducing the quote-to-bind time from 24 hours to 30 seconds—they justify architectural changes that would otherwise be rejected as "too different from the old system."
What is your methodology for quantifying and communicating the risk of "unknown unknowns"—tacit rules that never triggered during your sample data observation period but could surface catastrophically post-migration?
Candidates frequently present stakeholders with false confidence based on 100% test pass rates against historical data. The sophisticated answer acknowledges sampling bias in legacy data and advocates for stress-testing against synthetic scenarios. This involves generating fuzzed input data that exercises boundary conditions not seen in production logs, then comparing COBOL and new system outputs. Additionally, BAs must establish a circuit-breaker pattern in the new architecture: if the microservice encounters a transaction structure it cannot process (indicating a potential missed rule), it should gracefully degrade to calling the legacy SOAP wrapper (if available) or flagging for human review, rather than failing silently or defaulting to null values. The communication strategy involves probabilistic risk matrices showing that while 95% of rules are validated, a 5% residual uncertainty requires a three-month hypercare period with doubled manual reconciliation checks.