Skip to main content

 –

Best practices for data ingestion

Suggest edit Updated on September 21, 2021

Review these best practices for configuring data ingestion processes in your application.

Add data flow run details to the case review screen to help debug issues

Ingestion case review screen with data flow details
Ingestion case review screen includes details of the staging and xCAR data
                        flows, such as data flow ID, status, and updated records count.

Provide reports and access for the client's ETL and production support teams

Provide insight into the execution status (which is normally set up as an agent) by following these best practices:

  • Configure a case progress and monitoring report to identify execution statistics and status.
  • Provide access to the Case Manager portal to both the client's ETL and production support teams.
  • Schedule the report to be sent out as email attachment to the ETL and production support teams.
Case progress report
The report contains data for several ingestion case runs. Details of each
                        run are provided, such as the case ID, case status, the number of files
                        expected and found, the data flow ID, and so on.

For more information, see Creating a report.

Provide data to identify and resolve errors

Include the CaseID, DataFlowID, and the details that are available from the Batch processing landing page in error reports. This information provides key details to identify and resolve processing errors.

Check the S3 file count and correct file location regularly

The processing of the files in the S3 SFTP folders uses pattern matching to identify the files to be processed. The production support team must regularly verify that the files are being sent to the appropriate S3 locations and that the files in those locations match the pattern for the intended processing. The client's ETL teams regularly update their processing as data attributes are added or removed from the data sets transferred and mistakes can be made as the processing jobs that support those changes are updated.

Use parallel processing

File processing in Pega Platform provides several opportunities to process data in multiple parallel streams, which can substantially reduce processing time. To take advantage of the parallel processing of the data flows, divide large files into multiple smaller files, each with approximately the same number of records.

The number of concurrent partitions is determined by the Number of threads parameter that you set when creating a data flow run, as shown in the following figure:

Changing the number of threads for a data flow run
By clicking New on the Data flows landing page, you can create a new data
                        flow run and set the number of threads.

One thread is associated with each input file. If the input files are not approximately equal in size, processing takes more time as the entire process must wait until the last file is processed. For more information, see Creating a batch run for data flows.

You can also manage the thread count setting on the Services landing page, as shown in the following figure:

Changing the thread count for a batch data flow service
The Edit batch settings window contains the Thread count field. The value
                        in the field is 5.

For more information, see Configuring the Data Flow service.

For example, in a scenario where there are three nodes, you can process 15 files with 1 million records in each file faster than three files with 5 million records in each file or one file with 15 million records.

Archive files after processing

Archive the files in your Pega Cloud File Storage repository at the end of an ingestion run. If errors occur during an ingestion run and you have to reprocess the files, you can use the archived files and save time by not having to retransfer the files to the repository.

    Did you find this content helpful? YesNo

    0% found this useful

    Have a question? Get answers now.

    Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.

    Ready to crush complexity?

    Experience the benefits of Pega Community when you log in.

    We'd prefer it if you saw us at our best.

    Pega.com is not optimized for Internet Explorer. For the optimal experience, please use:

    Close Deprecation Notice
    Contact us