opensource.google.com

Menu

Posts from July 2024

DAGify: Accelerate Your Journey from Control-M to Apache Airflow

Friday, July 26, 2024


In the dynamic world of data engineering and workflow orchestration, organizations are increasingly migrating from legacy enterprise schedulers like Control-M to the open-source powerhouse, Apache Airflow. However, this transition often involves a complex and time-consuming process of converting existing job definitions. DAGify emerges as a beacon of efficiency in this scenario, offering an open-source solution to automate the conversion of Control-M XML files into Airflow's native DAG format.

DAGify isn't just a simple conversion tool; it's a migration accelerator, designed to significantly reduce the manual effort and potential errors associated with transitioning to Airflow. While it might not provide a perfect 1:1 migration in every case, its primary goal is to expedite the process, allowing developers to focus on optimizing their workflows in the new environment.


Introduction

Control-M has served as a reliable workhorse for many organizations, but its proprietary nature and limitations can become roadblocks in today's cloud-centric and agile data landscape. Apache Airflow, with its flexibility, scalability, and thriving community, presents a compelling alternative. However, the migration journey can be daunting, especially when dealing with intricate Control-M job definitions.

DAGify steps in to bridge this gap, offering an intuitive and extensible solution. By automating the conversion process, it empowers organizations to embrace Airflow's capabilities without the burden of manual translation. This translates to faster migrations, reduced errors, and a smoother transition overall.


Technical Details

Under the hood, DAGify employs a template-driven approach, making it adaptable to various Control-M configurations and Airflow requirements. It parses Control-M XML files, extracting crucial information about jobs, dependencies, and schedules. This data is then intelligently mapped to Airflow's operators, tasks, and dependencies, preserving the essence of the original workflow. While still under active development, DAGify already supports key Control-M features like job and dependency mapping. The project roadmap includes further enhancements, such as handling custom calendars and expanding support for other enterprise schedulers.


Template-driven conversion

DAGify employs a flexible template system that empowers you to define the mapping between Control-M jobs and Airflow operators. These user-defined YAML templates specify how Control-M attributes translate into Airflow operator parameters. For instance, the control-m-command-to-airflow-ssh template maps Control-M's "Command" task type to Airflow's SSHOperator, outlining how attributes like JOBNAME and CMDLINE are incorporated into the generated DAG.

The template's structure field utilizes Jinja2 templating to dynamically construct the Airflow operator code, seamlessly integrating Control-M job attributes.

Example:

A Control-M task like:

<JOB 
  APPLICATION="my_application" 
  SUB_APPLICATION="my_sub_application" 
  JOBNAME="job_1" 
  DESCRIPTION="job_1_reports"  
  TASKTYPE="Command" 
  CMDLINE="./hello_world.sh" 
  PARENT_FOLDER="my_folder">
  <OUTCOND NAME="job_1_completed" ODATE="ODAT" SIGN="+" />
</JOB>

is converted to an Airflow operator using the control-m-command-to-airflow-ssh-gce template:

job_1 = SSHOperator(
    task_id="x_job_1",
    command="./hello_world.sh",
    dag=dag,
)

The repository includes several pre-defined templates for common Control-M task types. The config.yaml file at the project's root allows you to customize which templates are applied during the conversion process.


Leveraging Google Cloud Composer

For organizations seeking a fully managed Airflow experience, Google Cloud Composer provides a compelling solution. It eliminates the complexities of managing Airflow infrastructure, allowing you to focus on building and orchestrating your data pipelines. DAGify seamlessly integrates with Google Cloud Composer, making it even easier to migrate your Control-M workflows to a cloud-native environment.


Try it yourself

Eager to experience the power of DAGify? It's readily available as an open-source project on GitHub: https://github.com/GoogleCloudPlatform/dagify. The repository provides detailed instructions on setting up and running DAGify locally or within a Docker container.

Key steps to get started:
  1. Clone the repository: git clone https://github.com/GoogleCloudPlatform/dagify.git
  2. Install dependencies: make clean (This sets up a virtual environment and installs required packages)
  3. Run DAGify: python3 DAGify.py --source-path=[YOUR-SOURCE-XML-FILE]

Remember, DAGify is an ongoing project, and community contributions are welcome! If you encounter any issues or have feature requests, feel free to open an issue on GitHub.


Conclusion

DAGify represents a significant leap forward in simplifying enterprise scheduler migrations to Apache Airflow. By automating the conversion process and seamlessly integrating with Google Cloud Composer, it empowers organizations to embrace the benefits of Airflow more rapidly and efficiently. Whether you're a seasoned Airflow developer or just starting your migration journey, DAGify is a valuable tool to explore.

Remember:

  • Thorough testing is crucial: Always test your converted DAGs in a staging environment before deploying them to production.
  • Leverage Airflow's ecosystem: Explore the vast array of Airflow plugins and integrations to further enhance your workflows.
  • Stay engaged with the community: Keep an eye on DAGify's development and contribute to its growth if you can!

Happy migrating!

By Konrad Schieban and Tim Hiatt – Google Cloud


Acknowledgments

Thank you to the following team members who made this solution possible: Shreya Prabhu, Harish S, Slava Guzanov and Joanna Rajaseharan from Google Cloud.

Google Blocks is now Open Source

Tuesday, July 16, 2024

In 2017, we shared Google Blocks with the world as a simple, easy and fun way to create 3D objects and scenes, using the new wave of VR headsets of the day.

We were thrilled to see the surprising, inventive and beautiful assets you all put together with Google Blocks, and continue to be impressed by the enthusiasm of the community.



We now wish to share the code behind Google Blocks, allowing for novel and rich experiences to emerge from the creativity and passion of open source contributors such as the Icosa Foundation, who have already been doing wonderful work with Tilt Brush, which we open-sourced in 2021.


"We're thrilled to see Blocks join Tilt Brush in being released to the community, allowing another fantastic tool to grow and evolve. We can't wait to take the app to the next level as we have done with Open Brush." 
– Mike Nisbet, Icosa Foundation

What’s Included

The open source archive of the Blocks code can be found at: https://github.com/googlevr/blocks

Please note that Google Blocks is not an actively developed product, and no pull requests will be accepted. You can use, distribute, and modify the Blocks code in accordance with the Apache 2.0 License under which it is released.

The currently published version of Google Blocks will remain available in digital stores for users with supported VR headsets. If you're interested in creating your own Blocks experience, please review the build guide and visit our github repo to access the source code.

Thank you all for coming on this journey with us so far, we can’t wait to see where you take Blocks from here.

By Ian MacGillivray – Software Engineer, on behalf of the Google Blocks team.

Bounds Checking Flexible Array Members

Tuesday, July 9, 2024

Buffer overflows are the cause of many security issues, and are a persistent thorn in programmers' sides. C is particularly susceptible to them. The advent of sanitizers mitigates some security issues by automatically inserting bounds checking, but they're not able to do so in all situations—in particular for flexible array members, because their size is known only at runtime.

The size of a flexible array member is typically opaque to the compiler. The alloc_size attribute on malloc() may be used for bounds checking flexible array members within the same function as the allocation. But the attribute's information isn't carried with the allocated object, making it impossible to perform bounds checking elsewhere.

To mitigate this drawback, Clang and GCC are introducing1 the counted_by attribute for flexible array members.


Specifying a flexible array member's element count

The number of elements allocated for a flexible array member is frequently stored in another field within the same structure. When applied to the flexible array member, the counted_by attribute is used by the sanitizer—enabled by -fsanitize=array-bounds—by explicitly referencing the field that stores the number of elements. The attribute creates an implicit relationship between the flexible array member and the count field enabling the array bounds sanitizer to verify flexible array operations.

There are some rules to follow when using this feature. For this structure:

struct foo {
	/* ... */
	size_t count; /* Number of elements in array */
	int array[] __attribute__((counted_by(count)));
};
  • The count field must be within the same non-anonymous, enclosing struct as the flexible array member.
  • The count field must be set before any array access.
  • The array field must have at least count number of elements available at all times.
  • The count field may change, but must never be larger than the number of elements originally allocated.

An example allocation of the above structure:

struct foo *foo_alloc(size_t count) {
  struct foo *ptr = NULL;
  size_t size = MAX(sizeof(struct foo),
                    offsetof(struct foo, array[0]) +
                        count * sizeof(p->array[0]));

  ptr = calloc(1, size);
  ptr->count = count;
  return ptr;
}

Uses for fortification

Fortification (enabled by the _FORTIFY_SOURCE macro) is an ongoing project to make the Linux kernel more secure. Its main focus is preventing buffer overflows on memory and string operations.

Fortification uses the __builtin_object_size() and __builtin_dynamic_object_size() builtins to try to determine if input passed into a function is valid (i.e. "safe"). A call to __builtin_dynamic_object_size() generally isn't able to take the size of a flexible array member into account. But with the counted_by attribute, we're able to calculate the size and improve safety.


Uses in the Linux kernel

The counted_by attribute is already in use in the Linux kernel, and will be instrumental in catching issues like integer overflows, which led to a heap buffer overflow. We want to expand its use to more flexible array members, and enforce its use in the future.


Conclusion

The counted_by attribute helps address a long-standing fortification road block where the memory bounds of a flexible array member couldn't be determined by the compiler, thus making Linux, and other hardened applications, less exploitable.

1In Clang v18.0.0 and GCC v15.0.0.

By Bill Wendling, Staff Software Engineer

.