You are on page 1of 9

DATALAKE CATALOG

USER GUIDE
Data Ingestion process

Source Datastore DWH S3 Bucket Destination

DS2

Types of sources supported by DWH


1. Mysql database(DB)
2. PosgresSQL( DB)
3. S3 bucket DS4
4. Oracle The ownership of DWH for data starts from
5. Salesforce here. The SLO( defined subsequently) is
6. MSSQL reported for this datastore.The Data Catalog
7. SFTP reports from here and considers the source
8. Kafka as the source datastore in the previous step
9. Zendesk

• DS2 and DS4 ( also referred to as Hive by business users) are two separate datastores. DS2 can only be accessed by the employees of PBPL while DS4 is accessible
by employees of OCL and PEPL.
• Access to these datasets happens through Hue interface.
Datalake Catalog
The URL: https://datalake.mypaytm.com gives access to the data catalog.

Data Catalog is to be clicked.

The catalog provides


• a complete list of all the datasets that are ingested by the DWH team ( i.e. those datasets that can be viewed in DS2 and DS4)
• a summary view of where the datasets be accessed (i.e. in DS2 or/ and DS4) and if accessible, what are the names given to the datasets
• Details on the source from where the datasets of DS2 and DS4 are populated from, the frequency of population, the last update date at source, the last date of
ingestion etc.
Datalake Catalog: Browsing Metadata of Various Datasets(1)

The dataset that is to be explored is to be typed here. Part of the name can also be input to get
relevant datasets. The list below gets populated accordingly. Note that the names of datasets are
such as they are popularly known at source and not as in Hive( i.e. do not search terms with
substring as “snapshot_v3” but rather as “marketplace.shippers”)
Datalake Catalog: Browsing Metadata of Various Datasets(2)
Click the table to explore further

Green means it is being ingested till s3 within the SLO( service level obligation)
It shows that the table is being populated from mysql datasource.

It shows that the table was last ingested 8 hours ago (ignore the timestamp).

It shows that the table source data ( from which the table is being populated) was updated 3 years ago. You can also
consider that the dataset had the latest update 3 years ago i.e. the datasource of the table was never updated post that.

This shows that the dataset is being populated by “INGESTION” i.e. raw dataset. It can also have values of OLAP, Fact.

Ingest type can be incremental (INCR) or full. INCR is generally for transactional tables and full for mapping tables.

Active datasets are those for which ingestion happens according to SLO

SLO means the number of minutes in which ingestion of the dataset is expected to happen.
Datalake Catalog: Knowing Metadata of a Particular Dataset
This is the name of the dataset entity in the dictionary

Specifies that the dataset is being ingested


Ingestion is happening in full
When was the data last ingested
In the ingested dataset, what is the max timestamp that is observed at source
When was the definition( eg. SLO) of the datasource last updated by DWH

Ingestion run happens normally after how many minutes?

Information in the snapshot is of the DWH owned data in the S3 bucket. Refer slide for clarity.
Datalake Catalog: Knowing Metadata of a Particular Dataset( Source)

When was the dataset created in DWH system

This is the source of the dataset. Here for example, marketplace.shippers is populated from sql source mktplace.shippers
Source of the dataset is mysql
Datalake Catalog: Knowing Metadata of a Particular Dataset( Destinations)

Mentions the various destinations that the dataset is present. In this snapshot for example, its in DS2

ID of the dataset at destination


The destination name where the dataset is present
ID of the dataset at source

Name of the dataset at the destination i.e. to access it, in DS2, we have to consider this as the table name

This was when the dataset was last synced in DS2. Note that for this dataset the ingestion was of FULL type i.e the
whole table is ingested

Similarly, there are other destinations like S3, DS4


Datalake Catalog: Knowing Metadata of a Particular Dataset( Schema)

This section provides the various fields that are present in the dataset and the type of fields in it. It also
mentions the primary key of the table.

You might also like