Skip to main content
Version: 2.3

5.1 Data Pump development

General intended use of Data Pumps

Data Pumps are designed to load data into the Data Context Hub. The data source can be a database, file system, or any other source accessible from the Data Context Hub environment. Loading data with a Data Pump is a long-running process and can only be executed one at a time on the job queue scheduler. This means that during a running Data Pump job (that is put on the job queue scheduler) no other job can be processed in parallel.

While Data Pump jobs are processed no resource restrictions are imposed (e.g. memory, disk space …) by the environment. In other words, you must not exceed the workers environment resources.

To optimize the workload, you can set up different containers instead of doing one single import (if possible).

Requirements

Development environment

  • Language: C#
  • Application output type: Class Library (.dll)
  • Target Framework: .NET 8.0
  • No target operating system

Dependencies

The following dependencies are required for developing a Data Pump and are provided as NuGet packages through GitLab. In order to access the packages an Access Token is required.

  • Explore.DataPumps: Base implementation of DataPumps
  • Explore.Common.SharedResources: Common shared resources (e.g. DataPumpSourceMap, DataPumpArtifact etc.)
info

These two NuGet packages are also part of GBS and don't need to be uploaded!

Packages can be downloaded with the following commands

# Get available versions
curl 'https://gitlab.c64.ai/api/v4/projects/6/packages/nuget/download/<package-name>/index' \
--header 'PRIVATE-TOKEN: <access-token>'

# Download package
curl 'https://gitlab.c64.ai/api/v4/projects/6/packages/nuget/download/<package-name>/<version>/<filename.version.nupkg>' \
--header 'PRIVATE-TOKEN: <access-token>'

The filename for a package is it's lower case variant, e.g. filename for Explore.DataPumps is explore.datapumps.

Check the following compatibility chart to select the correct package version

PackageVersionGBS Compatibility
Explore.DataPumps
2.0.3>= 2.0.x
3.0.12.1.x - 2.2.x
4.x>= 2.3.x
Explore.Common.SharedResources
2.0.4>= 2.x
3.0.02.1.x - 2.2.x
4.x>= 2.3.x

Data Pump Interface

Following sections explains the structure of the Data Pump interface each Data Pump is based on.

Parameters

public override string Parameters { get; }

Serialized list of parameters needed to connect a Data Pump with its data source. It must be set in the Constructor of the data pump:

private readonly string MODULE_KEY = null;

public override Uri ApiUrl { get; }
public override string Ident { get; } = Guid.Parse("00000000-0000-0000-0000-000000000000").ToString();
public override string Parameters { get; }

public DataPumpExample()
{
this.ApiUrl = new Uri("https://localhost:4040");

var params = new List<DataPumpParameter>
{
new () { Id = "xpl_dp_param_parameter_1", Name = "Parameter 1", Value = "", ForceReInitialize = true, Description = "Description", DisplayInContext = true },
new () { Id = "xpl_dp_param_parameter_2", Name = "Parameter 2", Value = "", IsMasked = true, Description = "Description", DisplayInContext = true },
new () { Id = "xpl_dp_param_parameter_3", Name = "Parameter 3", Value = "", IsMultiline = true, Description = "Description" }
};

this.Parameters = JsonSerializer.Serialize(params);
}

Version

Unique Version number that needs to be increased based on the changes in the Data Pump according to Semantic Versioning.

The version number could be represented in a text file VERSION linked to the project:

VERSION
1.0.0
.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<Version>$(PackageVersion)</Version>
</PropertyGroup>

<ItemGroup>
<!-- Copy version file to the package -->
<None Include="../../VERSION" Pack="true" CopyToOutputDirectory="Always" PackagePath="/" />
</ItemGroup>
</Project>

GetVersion() implementation

public override string GetVersion()
{
var version = Assembly.GetExecutingAssembly().GetName().Version;
return version == null ? "undefined" : $"{version.Major}.{version.Minor}.{version.Build}";
}

Module key

Returns the license module key and must be null for customer written Data Pumps.

Ident

The Ident is a unique number represented as a GUID with the following syntax: 00000000-0000-0000-0000-000000000000.
It’s mandatory that every Data Pump with a new version number gets a different Ident. You can’t upload two Data Pumps to GBS with the same ident.

GetEntityListAsync()

public abstract Task<List<string>> GetEntityListAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters)

Returns the entity names that would be provided during the loading process for a provided parameter set. This should be implemented as a shortcut to get entity names. E.g. in case of a database this function would return all found table names as entities.

  • parameters: is a List<DataPumpParameter> that is provided from the GBS UI needed to connect to data source ( see Parameters)
  • entityNames: List of entity names retrieved by GetEntityListAsync

PreLoadAsync()

public abstract Task<List<Explore.DataPumps.Entities.DataPumpSourceMap>> PreLoadAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters, List<string> entityNames)

This function will be used for the initialization of containers. It is a shortcut to get a small dataset that later is mapped to the data entities and the related columns. E.g. 15 rows from each database table.

LoadAsync()

public abstract Task<List<Explore.DataPumps.Entities.DataPumpSourceMap>> LoadAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters, List<string> entityNames)

The general entry point to retrieve data from the data source. Within this function the loading of data and any additional data enrichment can be done.

GetArtifactsAsync()

public abstract Task<List<Explore.DataPumps.Entities.DataPumpArtifact>> GetArtifactsAsync(List<Explore.DataPumps.Entities.DataPumpParameter> parameters, List<string> entityNames, System.Data.DataTable currentRowsInRepository = null)

Returns all found file artifacts from the source. E.g. all attachments from Jira issues.

  • currentRowsInRepository is the actual data of the target entity in GBS.

DataPumpArtifact

The DataPumpArtifact class represents an artifact and has the following structure:

Variable NameDescription
EntityFound entity name from GetEntityListAsync (e.g. Jira Issue Key)
ArtifactTypeIDType of Artifact (see list below)
DataMapSourceIndexIndex of row in currentRowsInRepository that contains the entity name (e.g. Jira Issue Key) => if -1 not found
ValueFound artifact value (e.g. whole Url to file attachment)
TitleTitle of the artifact (e.g. file.png)

Following list contains all artifact types supported by GBS. Data Context Hub Explorer offers different operations based on the artifact type.

  • Unknown = 1
  • Video = 2
  • Text = 3
  • Image = 4
  • GeometryFile = 5
  • DataSet = 8
  • Number = 9
  • PDFDocument = 10
  • Link = 11

Logging

Several different debugging levels are available in the DataPumpBase class. The logs will be shown in the GBS Log.

  • protected void LogDebug<T>(string message, T source, [CallerMemberName] string callerMethod = "")
  • protected void LogInfo<T>(string message, T source, [CallerMemberName] string callerMethod = "")
  • protected void LogWarn<T>(string message, T source, [CallerMemberName] string callerMethod = "")
  • protected void LogError<T>(string message, Exception ex, T source, [CallerMemberName] string callerMethod = "")
  • protected void LogException<T>(string message, Exception ex, T source, [CallerMemberName] string callerMethod = "")