Skip to main content

 –

File types used in data ingestion

Suggest edit Updated on September 21, 2021

Ingesting customer data into Pega Customer Decision Hub on Pega Cloud involves three types of files: data files, manifest files, and token files.

Before you configure the ingestion process, familiarize yourself with these file types, their functions, and formats. Establish the recommended file naming conventions and folder structures.

Data files

These files contain the actual customer data and are generated by the client's extraction-transformation-load (ETL) team. The files might contain new and updated customer data as well as data for purging or deleting existing records. One ingestion run can support one or more data files. A best practice is to divide large files into multiple smaller files to allow parallel transfer and faster processing. Files can be in CSV or JSON format, and can be compressed. .gzip and .zip compressions are supported. File encryption is supported, but requires custom Java coding.

The following table shows sample content of a CSV file for adding or updating customer data:

customer_idfirst_namelast_namemiddle_initialclv_valuecustomer_typebilling_citybilling_statebilling_zip
811162399BillSmithO3ITAYLORMI48180
443470562JohnKennelF11IPORTLANDOR97266

The following table shows sample content of a CSV file for deleting or purging customer data:

customer_id
811162389
443470534

Manifest files

The manifest file contains metadata about the data files being transferred. There is one manifest record for each ingestion or purge to be transferred. Examples of ingestion or purge: customer data ingestion, customer data purge, account data ingestion, account data purge. While processing, the file listeners listen for manifest files. Manifest files are in XML format.

The manifest file is backed by a data model which is a part of the Data-Ingestion-Manifest class.

The following example is a manifest file that is typically used:

<?xml version="1.0" ?>
<manifest>
 <processType>CustomerDataIngest</processType>
 <totalRecordCount>1300</totalRecordCount>
 <files>
 <file>
 <name>CustomerDataIngest_MMDDYYYY_000.csv</name>
 <size>149613</size>
 <recordCount>700</recordCount>
 </file>
 <file>
 <name>CustomerDataIngest_MMDDYYYY_001.csv</name>
 <size>125613</size>
 <recordCount>600</recordCount>
 </file>
 </files>
</manifest>
FieldDescriptionExample
processTypeType of data being loaded. This field also identifies the data flow to run.

CustomerDataIngest

AccountDataIngest

totalRecordCountTotal number of records across all data files.1300
recordCountRecord count for one file.

700

600

nameName of the data file that needs to be agreed with the client and configured in the file data set. The name can have suffix substitution.CustomerDataIngest_MMDDYYYY_000.csv
sizeSize for one file in bytes. This field is optional and used if there is a need to do file size validation. Additional process work is required if size validation must be performed.149613

Token files

Token files are used to signal that the data file transfer is complete. Token files are optional if your SFTP client application can be configured to send the manifest file after successfully transmitting all data files. Otherwise, token files are required. Token files are typically generated by the SFTP client application when the SFTP has completed the transfer of a file. There is one token for each file transferred. A token file does not need to contain any data, only the presence or absence of the file is checked.

The manifest file and the data files are transferred to the Amazon S3 location through SFTP. Manifest files are typically very small in size (a few kilobytes). The actual data files are typically quite large (multiple gigabytes). As a best practice, large files are divided into smaller files that can be transferred in parallel by the SFTP client application. As a result, the manifest files are transferred before the data files.

Pega Platform file listeners are configured to listen and process the manifest files. As manifest files arrive first, ahead of the data files, their processing completes almost immediately, which in turn kicks off the data flow process. However, the data flow process fails as the data file transmission has not been completed.

Hence, it is critical that the manifest file is sent last. If this cannot be guaranteed by your SFTP client application, then additional work by the SFTP team is required to create a token file and send it after every data file is successfully transferred. The ingestion case is configured to wait until all the token files have arrived before invoking the data flow to start loading the data.

For example, after the DataIngestion_04182020_1.csv is successfully transmitted, a token file DataIngestion_04182020_1.csv.tok or DataIngestion_04182020_1.csv.done is created by the SFTP application.

File naming convention and folder structure

Both the file listeners and file data sets use pattern matching based on file names. As a result, the following file types, naming conventions, and folder structures must be agreed in advance:

  • Naming convention of manifest files
  • XML as the format of manifest files
  • Naming convention of data files
  • Format of data files: CSV or JSON
    • For CSV:
      • Header row (field names to be mapped to the Spine tables and xCAR data sets)
      • Delimiter
    • For JSON: Ensure that property names match the names of the fields of the Spine tables and xCAR data sets.
  • Folder location for manifest and data files
    Did you find this content helpful? YesNo

    0% found this useful

    Have a question? Get answers now.

    Visit the Collaboration Center to ask questions, engage in discussions, share ideas, and help others.

    Ready to crush complexity?

    Experience the benefits of Pega Community when you log in.

    We'd prefer it if you saw us at our best.

    Pega.com is not optimized for Internet Explorer. For the optimal experience, please use:

    Close Deprecation Notice
    Contact us