Professional Documents
Culture Documents
USER GUIDE
Data Ingestion process
DS2
• DS2 and DS4 ( also referred to as Hive by business users) are two separate datastores. DS2 can only be accessed by the employees of PBPL while DS4 is accessible
by employees of OCL and PEPL.
• Access to these datasets happens through Hue interface.
Datalake Catalog
The URL: https://datalake.mypaytm.com gives access to the data catalog.
The dataset that is to be explored is to be typed here. Part of the name can also be input to get
relevant datasets. The list below gets populated accordingly. Note that the names of datasets are
such as they are popularly known at source and not as in Hive( i.e. do not search terms with
substring as “snapshot_v3” but rather as “marketplace.shippers”)
Datalake Catalog: Browsing Metadata of Various Datasets(2)
Click the table to explore further
Green means it is being ingested till s3 within the SLO( service level obligation)
It shows that the table is being populated from mysql datasource.
It shows that the table was last ingested 8 hours ago (ignore the timestamp).
It shows that the table source data ( from which the table is being populated) was updated 3 years ago. You can also
consider that the dataset had the latest update 3 years ago i.e. the datasource of the table was never updated post that.
This shows that the dataset is being populated by “INGESTION” i.e. raw dataset. It can also have values of OLAP, Fact.
Ingest type can be incremental (INCR) or full. INCR is generally for transactional tables and full for mapping tables.
Active datasets are those for which ingestion happens according to SLO
SLO means the number of minutes in which ingestion of the dataset is expected to happen.
Datalake Catalog: Knowing Metadata of a Particular Dataset
This is the name of the dataset entity in the dictionary
Information in the snapshot is of the DWH owned data in the S3 bucket. Refer slide for clarity.
Datalake Catalog: Knowing Metadata of a Particular Dataset( Source)
This is the source of the dataset. Here for example, marketplace.shippers is populated from sql source mktplace.shippers
Source of the dataset is mysql
Datalake Catalog: Knowing Metadata of a Particular Dataset( Destinations)
Mentions the various destinations that the dataset is present. In this snapshot for example, its in DS2
Name of the dataset at the destination i.e. to access it, in DS2, we have to consider this as the table name
This was when the dataset was last synced in DS2. Note that for this dataset the ingestion was of FULL type i.e the
whole table is ingested
This section provides the various fields that are present in the dataset and the type of fields in it. It also
mentions the primary key of the table.