CWL code to code compiler

Created by Mattia Mancini, last modified by Mattia Mancini on 14 May 2024, 06:55

Introduction

ASTRON is the Netherlands Institute for Radio Astronomy. We investigate the signals that the Universe emits in the form of radio waves. Our mission is to make discoveries in radio astronomy happen. We therefore not only do fundamental astronomical research. We also design, build and manage some of the world’s leading radio telescopes, and push the boundaries of technology to make increasingly better and more sensitive instruments. To increase the sensivity of our telescopes we also strive to develop new more accurate pipelines to calibrate the data and reduce the noise.

The Low-Frequency Array (LOFAR) is a large radio telescope with an antenna network located mainly in the Netherlands and spreading across seven other European countries. It is currently the largest low-frequency radio telescope worldwide, operating at the lowest frequencies observable from Earth.LOFAR is a multipurpose sensor network with an innovative computer and network infrastructure capable of handling extremely large data volumes. It consists of many receiver stations spread across the Netherlands and Europe that stream high volumes of data over private and public wide-area networks to a central data processing facility in the Netherlands. The massive data generated by LOFAR has to be processed offline using pipelines that are run in heterogeneous high-performance computing (HPC) environments.

Common workflow language (hereafter CWL) has been growly adopted in astronomical workflow in radio astronomy for various reasons such as: portability, clear defined typing and easily documented workflows. However, given the current data and performance requirement specifical of the data volumes of radio astronomy use cases the standard pipeline executor, such as toil, struggle to optimize at run time for a specific system. Other framework, such as DASK, provide a very good abstraction for execution of workflows but they lack a well documented structure such as CWL. The possibility to have a translation and optimization given the execution system on the fly from a CWL workflow to a DASK is a promising way forward to obtain the best of both worlds.

Technologies used in this project

CWL (https://www.commonwl.org/)
python
singularity
DASK (https://www.dask.org/)

Goals of the project

This project has three main goals:

Can we automatically translate CWL into DASK?
Is it possible to optimize the execution of a pipeline before the execution given the system characteristics?
Can this approach be generalized to any CWL pipeline?

What we develop

In this project we will develop a light-weight application that

takes a CWL workflow a description of a system and generates a DASK workflow
by providing suggestions or hints can optimize the workflow to have coalesced data access (ex. the copy of the data between nodes is minimized)