This documentation site is for previous versions.

Visit our new documentation site for current releases.

Best practices for data ingestion

Updated on August 3, 2022

This content applies only to Pega Cloud environments

Review these best practices for configuring data ingestion processes in your application.

Add data flow run details to the case review screen to help debug issues

Ingestion case review screen includes details of the staging and xCAR data flows, such as data flow ID, status, and updated records count. — Ingestion case review screen with data flow details

Provide reports and access for the client's ETL and production support teams

Provide insight into the execution status (which is normally set up as an agent) by following these best practices:

Configure a case progress and monitoring report to identify execution statistics and status.
Provide access to the Case Manager portal to both the client's ETL and production support teams.
Schedule the report to be sent out as email attachment to the ETL and production support teams.

The report contains data for several ingestion case runs. Details of each run are provided, such as the case ID, case status, the number of files expected and found, the data flow ID, and so on. — Case progress report

For more information, see Creating a report.

Provide data to identify and resolve errors

Include the CaseID, DataFlowID, and the details that are available from the Batch processing landing page in error reports. This information provides key details to identify and resolve processing errors.

Check the S3 file count and correct file location regularly

The processing of the files in the S3 SFTP folders uses pattern matching to identify the files to be processed. The production support team must regularly verify that the files are being sent to the appropriate S3 locations and that the files in those locations match the pattern for the intended processing. The client's ETL teams regularly update their processing as data attributes are added or removed from the data sets transferred and mistakes can be made as the processing jobs that support those changes are updated.

Use parallel processing

File processing in Pega Platform provides several opportunities to process data in multiple parallel streams, which can substantially reduce processing time. To take advantage of the parallel processing of the data flows, divide large files into multiple smaller files, each with approximately the same number of records.

The number of concurrent partitions is determined by the Number of threads parameter that you set when creating a data flow run, as shown in the following figure:

By clicking New on the Data flows landing page, you can create a new data flow run and set the number of threads. — Changing the number of threads for a data flow run

One thread is associated with each input file. If the input files are not approximately equal in size, processing takes more time as the entire process must wait until the last file is processed. For more information, see Creating a batch run for data flows.

You can also manage the thread count setting on the Services landing page, as shown in the following figure:

The Edit batch settings window contains the Thread count field. The value in the field is 5. — Changing the thread count for a batch data flow service

For more information, see Configuring the Data Flow service.

For example, in a scenario where there are three nodes, you can process 15 files with 1 million records in each file faster than three files with 5 million records in each file or one file with 15 million records.

Archive files after processing

Archive the files in your Pega Cloud File Storage repository at the end of an ingestion run. If errors occur during an ingestion run and you have to reprocess the files, you can use the archived files and save time by not having to retransfer the files to the repository.

Previous topic File types used in data ingestion
Next topic Configuring the data ingestion process

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Visit the Support Center

Get Started with Community

Best practices for data ingestion

Add data flow run details to the case review screen to help debug issues

Provide reports and access for the client's ETL and production support teams

Provide data to identify and resolve errors

Check the S3 file count and correct file location regularly

Use parallel processing

Archive files after processing

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

Get Started with Community

Add data flow run details to the case review screen to help debug issues

Provide reports and access for the client's ETL and production support teams

Provide data to identify and resolve errors

Check the S3 file count and correct file location regularly

Use parallel processing

Archive files after processing

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

We'd prefer it if you saw us at our best.