Migration of Hadoop[On premise/HDInsight] to Azure Databricks.

Share post

Hitting the problem statement of Hadoop:

Databricks is ongoing support and maintenance challenges that include setting up servers, networking, storage, installing software, and configuring best practices for deployed technologies of Databricks . The Operation engineer team required for ongoing upgrades, patches, and maintenance. Migration of Hadoop technical challenges with items like small files performance.
Oozie the packaged service for scheduling and workflow automation was too complex and difficult to use forcing customers to choose with own enterprise schedulers.

What Databricks brings to the picture of Hadoop:

  • A unified platform for data and AI: Databricks is one of the cloud platforms for massive-scale data engineering and collaborative data science.
  • Shared Notebooks: Collaborate in different languages from Python to Scala to SQL, and share code via notebooks with revision history and GitHub integration.
  • Data Integration Integrating with source systems made easy with a wide range of connectors.
  • Reliable and Scalable Clusters Migration of Hadoop in Databricks provides automated cluster management spinning up clusters, determining their optimal size to job, and scaling them down when the job done.
  • Job Scheduling with Databricks: It allows for job scheduling via Notebooks.
  • No need to rewrite your code for production, your working Notebook put right into production you can chain Notebooks to create a workflow that enables a component architecture.

Delta Lake implements traditional Parquet files with a transaction log defining what data and files are most recent so that when a job queries the datasets, users are presented with accurate consistent datasets. Other key features include ACID transactions, scalable metadata handling, time travel, schema enforcement, schema evolution, audit history, and full DML support offering UPDATE, DELETE, and MERGE INTO capabilities.

More on Databricks the Gartner magician for 2020 in my link below:


The Migration of Hadoop Plan -platform migration can see cost savings of ~30%:

Current State Evaluation: Gather information about the current Hadoop environment including tools and technologies, data sources, use cases, resources, integrations, and service level agreements. Review HDInsight sizing.

Create a complete inventory:

  • The tools and technologies used in the current environment native as well as third-party.
  • This helps us identify the tools that we could get rid of during migration as Databricks could have something built. Evaluate tools integrated seamlessly into a Cloud-native environment.

Identify Data Sources that deliver results of Hadoop:

  • Databricks could be external to the Hadoop Databases, ERP, CRM, streaming, etc. Or internal data sources used for integrations feeding data out of the platform.
  • Fully understand the applications running in the current Hadoop environment. This will better define the amount of effort in migrating the on-premises environment to the cloud.
  • Understanding tools and processes in existing staff’s daily workflow and making alternatives available will make sure that people feel comfortable using Databricks.
  • External applications accessing the Hadoop environment need to ensured that the cloud-native Databricks environment can serve data to these applications.
  • Access patterns for authentication and authorization will need to be evaluated. However, external applications connecting using JDBC or ODBC should work out of the box.
  • Security and governance configurations must collected through tools like Sentry, Ranger, and HDFS ACLs. Use of Kerberos, Active Directory, encryption-at-rest, and encryption-in-transit must be taken into account.

Implementation of Hadoop:

  • Prioritized data sources, apps, and tools selected for migration based on evaluation.
  • Use two pipelines (medium and large) to assess cost and challenges.
  • Lift and shift with iterative optimizations for timely completion.
  • Storage Migration Using the Data Source inventory move data stored in HDFS to the cloud vendor’s storage layer (Blob Storage or S3).
  • Apply the same information architecture to the new cloud storage file system and the resulting folder and file structure should be a one-to-one match of the HDFS file system.
  • You may face challenges migrating role-based access controls to the new cloud storage system.
  • Sometimes a tool will need to developed to migrate these policies for each cloud vendor.

Hive Meta store Migration :

  • Databricks the next step is to migrate the Hive Meta store from Hadoop to Databricks.
  • Hive Meta store contains all the location and structure of all the data assets in the Hadoop environment.
  • Migrating the Hive Meta store is required for users to query tables in Databricks notebooks using SQL statements.

HiveQL/Impala Migration:

  • It is a Look at how HIVE could potentially moved to SPARK SQL. Many times customers use Hive SQL or Impala files to execute pieces of data pipeline or workflow.
  • After the migration of both the storage assets and the Hive Meta store, these types of workflow items can use Spark SQL within a notebook.
  • Map-R converts heavy OLTP application data pipeline into a Spark Streaming application on Databricks and landing data in Azure Cosmo DB or DynamoDB.
  • Map-R Streams (Map-R Event Store) These types of integrations will need to the migrated to a Spark Streaming application.

Apache to upgrade Databricks:

Apache Kudu use cases will primarily be migrated to Delta Lake, as it offers comparable capabilities. Impala scripts and Spark applications utilizing Kudu will be migrated appropriately.

Apache HBase:

  • Use cases that are primarily OLTP driven will be converted to Spark Streaming applications utilizing Azure Cosmo DB or DynamoDB for storage.
  • Analytics applications and migrated to Spark Streaming and Delta Lake.
  • Apache Solr most cases that utilize Solr will migrate to Amazon Elasticsearch Service or Azure Cognitive Search.

Apache Spark Application Migration:

  • Any Apache Spark version 1 application will need to refactored to an Apache Spark version 2 application as Databricks does not support Apache Spark version1.
  • Lift and Shift model for as long as the source data is available to Databricks and the application can be packaged as a Jar these applications should run on Databricks.

Apache Spark to Delta Lake Migration:

  • Applications that are doing a small number of updates or deletes will be prime candidates for this refactoring.
  • Other considerations will be around performance, ACID transactions, schema enforcement, and data consistency.
  • Flat file to Delta file type to remove some of the small file challenges, better compression/ performance, and SQL access

Look at VM types to see if further optimizations to performance and cost can be made:

  • Move where possible more workloads from Interactive to Automated ( lower cost).
  • Review Pipeline efficiency and code — this is an ongoing optimization likely to continue after migration.

Validation Phase:

  • The final phase will be to validate the outcome of the migration from Hadoop to Databricks. This step should be performed using traditional A -> B testing.
  • The customer will need to be running the existing Hadoop implementation alongside the cloud-native Databricks offering.
  • Scripts and processes should be developed to ensure the delivered results in Hadoop match with those in Databricks.
  • As applications and use cases get cleared and all checkouts have been performed they can be shut down in Hadoop.

Leave A Comment

Your email address will not be published. Required fields are marked *