During data processing, resources like storage and processing power are consumed continuously. It is advisable to monitor and manage the amount of resources allocated to each pipeline, especially since this generates costs on AWS.

Here’s how Nekt allows you to control these resources.

Time travel

With a little bit of code, you can check (and use) previous versions of your tables. The maximum amount of time you can go back navigating through older versions of a table is defined by the number of days set on Delta Log Retention Duration.

The second parameter, Delta Deleted File Retention Duration, defines the number of days the files will be stored. These days start counting as soon as the Delta Log period ends, meaning that you still have the file in case of an emergency, but cannot easily navigate through that older versions - you’d need to request it from us.

By default, the tables have a 30 days Delta Log Retention Duration and a 7 days Delta Deleted File Retention Duration. So you have the last 30 versions of your table to explore. You can change this on each table’s Settings tab.

Retries

If a pipeline fails due to a system instability, you might want it to be triggered again automatically - usually depending on how important that pipeline is for you. You can define some parameters in each Source, Transformation or Destination settings tab to help you manage the retries.

  1. Number of retries: The number of retries that should be performed before the run is considered a failure.
  2. Retry delay: Delay between retries (in seconds).
  3. Max Consecutive Failures: Set the maximum number of consecutive failures before inactivating this transformation. This helps prevent unnecessary retries and resource usage. Ideal for avoiding repeated failures over weekends or off-hours.

Executions performance

Sometimes, additional and parallel resources are required to run a pipeline faster than a single instance would allow. Edit the following parameters in each transformation settings tab to find the right balance between performance and resource consumption:

  1. Spark Driver Cores: The number of CPU cores allocated to the Spark Driver, which manages the overall execution of the Spark application.
  2. Spark Driver Memory: The amount of memory allocated to the Spark Driver, which is responsible for managing the execution of the Spark application.
  3. Spark Executor Cores: The number of CPU cores allocated to each Spark Executor, which is responsible for executing individual tasks within the application.
  4. Spark Executor Instances: The number of Spark Executors that will be launched to run the tasks of the application. Adjust based on the data size.
  5. Spark Executor Memory: The amount of memory allocated to each Spark Executor, which processes individual tasks. Adjust based on the task’s memory requirements.