Collate Blog

OpenMetadata Release 1.0 — The Journey So Far

··

12 min read

OpenMetadata Release 1.0 — The Journey So Far

Just over 1.5 years ago, we announced OpenMetadata, an open-source project to solve metadata problems once and for all so the data ecosystem can move on to innovating around metadata rather than building metadata platforms over and over again. To do this, we built a metadata platform from the ground up in open source, employing years of experience and the best practices of taking schema-first and API-first approaches.

With OpenMetadata 1.0 release, we are happy to announce that we have accomplished much of our audacious vision — A central metadata platform based on metadata standards with Discovery, Collaboration, Governance, Data Quality, and Data Insights as applications on top of metadata.

Let’s take a look back at our journey so far to get to this significant milestone of Release 1.0, where we are, and where we plan to go from here.

Community in Numbers

In open source, we feel strongly that to really do something well, you have to get a lot of people involved. — Linus Torvalds

Thanks to our community, OpenMetadata has become one of the fastest-growing projects in the data space. First and foremost, we want to thank more than 160 contributors for making significant code contributions to the project.

  • Over 2800 users in our Slack community
  • An average of 500 Slack threads monthly
  • An awesome median response time of 9 minutes for open-source support
  • 100s of deployments in small, medium, and large organizations and growing
  • In use across several industry verticals

The constant feedback on bugs, suggestion for improvements, and request for features from our users has made the incredible progress of OpenMetadata possible. Come join our vibrant community of users and developers in this journey to redefine data.

Release 1.0 Milestone in Numbers

Frequent releases keep the momentum going, turning ideas into reality with each iteration. — Anonymous

In a short period of 1.5 years, we have made more than 40 releases. That is, a major feature release every 4–6 weeks and a minor bug fix release every two weeks! And every feature release is packed with a number of significant features.

The development velocity to get to Release 1.0 has been astounding. It took more than 6200 Pull Requests at over 300 per month and over 10 per day! Our pace of development is 2x to 10x compared to other communities. This translates into 100s of features added and at Release 1.0, we are happy to say, OpenMetadata is the most comprehensive and feature-rich platform.

Our Accomplishments So Far

We have achieved most of the work we laid out in our vision.

  • Single source of Truth for all the metadata in an organization
  • Centralized metadata repository stored as metadata graph
  • Metadata based on Standards instead of proprietary formats
  • Metadata APIs to foster innovation
  • Metadata applications instead of stand-alone tools that fragment metadata
  • Seamless collaboration to break people disconnect

Here are some highlights:

Metadata Schema Specifications

Metadata is the most important data in an organization. It should not be trapped in proprietary formats and poorly modeled key-value pairs without clear schema and documentation.

That is why we take the schema-first approach to model metadata using JSON schemas. JSON schemas help us define controlled vocabulary for metadata, structured metadata instead of key-value blobs, constraints in the schema at design time for validation, and most importantly, human and machine consumable documentation eliminating guesswork to understand metadata.

OpenMetadata specification has 100+ types as Controlled Vocabulary to model metadata which is used to define 40+ types of entities to model data assets (Tables, Dashboards, Pipelines, ML Models, Lineage, Glossaries, and Data Quality tests, to name a few) and 100s of types of relationship between them to build a metadata graph. We are driving the adoption of these standards to eliminate poorly modeled proprietary metadata working with other communities.

Every organization must have the right to export its metadata and migrate it to a tool they choose without the hassles of arduous custom integration work.

Metadata API Specifications

Metadata should be shareable across the tools. Without this, at present, every tool has to build its own metadata leading to duplication, fragmentation, and inconsistency.

We have comprehensive APIs that are built on top of OpenMetadata specifications. Every entity has strongly typed and comprehensive OpenAPI-based APIs.

We believe APIs will unlock the innovation around OpenMetadata for observability, data management, and other automation to reduce a lot of manual mundane tasks in data.

Metadata Repository

The metadata repository stores and indexes the metadata as a centralized metadata graph using OpenMetadata schema specification. It provides rich APIs for creating, modifying, and deleting metadata. The metadata of an entity is versioned, maintaining the history of all the changes to understand how your data has evolved over time.

OpenMetadata being a central metadata repository serves as a single event hub for all the changes to your data across different systems and tools. The Events API provides these change events over webhook so that other systems can track the changes and trigger workflows.

The metadata repository has a Pluggable Connector model and an Ingestion Framework to ingest metadata from diverse sources. We have over 60 connectors to connect to various metadata sources to collect and organize metadata.

Metadata Applications

As described in our vision, many stand-alone current tools can be built as simple applications on top of Metadata.

We have built five core applications:

  1. Discovery
  2. Collaboration
  3. Governance
  4. Data Quality
  5. Data Insights

There is no need to deploy separate stand-alone tools for these core metadata functionality. Multiple tools create silos of fragmented metadata, obfuscating the true picture of data and leading to user frustration.

With the all-in-one OpenMetadata, simplify your data architecture, streamline the user experience, reduce the cost of buying multiple tools, and lessen the operational burden of running them.

Architectural Simplicity

Keeping the architecture simple was a key goal that we started with. We use open-source frameworks that are widely used and are in active development, such as Jetty, JSON schema, SQLAlchemy, etc., instead of other solutions that are dependent on proprietary technologies. We have kept our dependencies small — MySQL/Postgres for storing metadata, Elasticsearch/OpenSearch for indexing, and a workflow system for running automation jobs.

You can run multiple instances of OpenMetadata servers for high availability and horizontal scalability. Running OpenMetadata is ridiculously simple in the cloud. Bringing up an OpenMetadata instance for POC can be done in under 5 mins. You need a database and search service provided by the Cloud and two VMs for running OpenMetadata. Many other metadata solutions need a larger number of VMs, Kafka, Cassandra, graph database, etc., making installing and managing them hard with complex failure modes.

The cost of operating other solutions is also much higher due to too many dependencies and the architectural complexity. OpenMetadata is as simple as it gets without compromising on scalability and features.

Core Features

The features in OpenMetadata relate to core applications like Discovery, Collaboration, Governance, Data Quality, Lineage, and Data Insights.

Data Discovery

OpenMetadata has rich discovery functionality to discover all your data assets in a single place with:

  • Keyword Search
  • Faceted Search — filter by owner, tags, columns, and other metadata attributes
  • Search ordering by relevance, usage, freshness
  • Data asset preview in search
  • Advanced Search using syntax editor with and/or conditions

Data Documentation

Documenting data is key to understanding the data and getting the data outcomes correct.

  • Rich markdown-based documentation
  • Clear ownership
  • Label the data from Classifications to describe the type of data
  • Label the data from Glossaries to add semantics to the data
  • Tiers for describing the importance of data
  • Automated documentation of data volume, usage, queries, and related assets

Data Collaboration

OpenMetadata is a catalyst for collaboration — brings data teams together to start a conversation, break information silos, and share organizational knowledge.

  • Activity feeds to see all data evolution in a single place
  • Conversation threads around data assets to foster collaboration & discussions
  • Request/suggest descriptions and tags for Crowd Sourcing information
  • Collaborative Tasks and resolution workflows
  • Approval workflows

Metadata Events

Every change in the metadata of a data asset be it from the source or user-generated is captured as Versions. Captures the details of which data assets changed, who changed it, when it changed, and how.

  • Events API for getting change events
  • Notifications via Webhook, Slack, Microsoft Teams, GChat, and emails
  • Rich event filtering by type of entity, type of change, and content of change
  • Get alerts on critical events
  • View all the data evolution in one place
  • Easy debugging by identifying the backward incompatible change events

Data Governance

OpenMetadata has many features that help you govern the data:

  • Classifications to define types of data (sensitive, non-sensitive, PII, etc.)
  • Glossaries to define the semantics of data
  • RBAC + ABAC based access control
  • Auto classification based on NLP
  • Review and Approval of workflows

Data Quality and Profiler

We are democratizing data quality to deliver trust in your data. You can easily create tests, run and schedule them, and send notifications to Slack, MS Teams, or Email with No-Code. The tests in OpenMetadata can also be used in your ETL jobs or dbt to validate the data during transformation.

  • Data Profiler captures table, column, and usage statistics
  • Create Table level tests and Column level tests & deploy them right within the UI
  • Tests suites to group tests
  • Visualize the test results right from OpenMetadata
  • Get Alerted on test failures
  • Data quality dashboards

You can also integrate the test results from other tools like Great Expectations or dbt.

Data Lineage

OpenMetadata supports end-to-end lineage for data assets to help you understand the data flow and do impact analysis.

  • End-to-end lineage traceability
  • Lineage from pipelines
  • Lineage from DBT
  • Lineage through SQL query analysis including Column-level lineage
  • Manual lineage with easy-to-use drag-and-drop UI to capture user knowledge
  • Lineage API, users can access lineage data or add new lineage through the APIs
  • Huge Contributions to improving the SQL dialect parsing to other open-source projects — SQlLineage and SQlFluff

Data Insights

OpenMetadata was never intended to be yet another passive data cataloging tool. With Data Insights you can track how your organization is doing with data, what improvements are needed, and set goals for data teams to get to the next level. Using these KPIs and setting goals for improving the data is key to establishing a strong data culture.

  • Total data assets and the growth over time
  • Ownership coverage and improvements over time
  • Documentation coverage and improvements over time
  • Tiering improvements over time
  • Define KPI goals and track organizational metrics
  • Weekly data reports to inform and engage teams to continuously improve data

Security

As the Metadata platform, we know how important it is to keep the data secure. That’s why we invested in security from Day 1.

  • SSO integration — Auth0, Azure, Google, Okta, OneLogin, Amazon Cognito SSO, Key Cloak, LDAP, Custom OIDC, Basic Authentication, and many more.
  • Secrets Manager to store all your credentials in a Secret Store so OpenMetadata does not need to store any sensitive information.
  • All API accesses are authenticated, and access control is enforced.
  • Wire Encryption support through HTTPS/TLS 1.2.
  • All releases go through Snyk and Github Dependatabot scans to ensure zero known vulnerabilities.

Some firsts for the project

Here are some firsts for the project that we are proud of:

  • The first project to start with Metadata Specification with Schema-First and API-First approach.
  • Centered around People Collaboration with Activity Feeds, Conversation Threads all around the context of data in a single place.
  • One-click and No-Code ingestion from UI to simplify the deployment of the connectors to extract metadata.
  • Native Data Quality and Profiler built as part of the OpenMetadata with no need for any external tooling.
  • Metadata versioning and visualization from the UI to show all the changes for a data asset, who made the change, and when. Lineage
  • First to support Manual Lineage Editor.
  • Webhook-based Events APIs and Websocket support for real-time notifications
  • Support for hierarchical teams to capture organizational structure from small to very large and the ability to govern by attaching policies at any level.
  • Localization — Support for multiple languages, Chinese, English, French, Japanese, Portuguese, and Spanish.

What is Next?

We have a strong foundation for what follows next after Release 1.0. These are some areas that we are excited about.

  1. Automation — The goal of OpenMetadata is to automate much of the time-consuming manual work done today. Understanding the cost of data and queries, deleting unused data, identifying similar data assets, anomaly detection, etc., are some examples of automation. Users can also build such automation workflows using APIs as per their organization’s specific needs.
  2. Data Insights — OpenMetadata provides insights about how you are doing with your data. Currently, you can see how your data is growing over time, how much of it is owned, and what is the description coverage along with KPIs. We plan on adding more Data Insights metrics, charts, and KPIs to get an even deeper understanding of an organization’s data. Some examples are unused data assets, cost analysis, data quality coverage, and audit reporting. Data Insights will also evolve to support many of the Data Observability use cases.
  3. User Experience — UX is something the community deeply cares about. We have made many improvements from the beginning of the project in this area. We plan to add features to help organizations customize the UI per their needs. We will also add support for user personas, such as Data Engineers, Data Scientists, Data Citizens, etc., where the UI can be customized based on the needs of that persona.
  4. Ubiquitous Metadata — High-quality metadata stored in OpenMetadata should be available to users in the tools that they use. We have already created Chrome Extension to facilitate this. Continuing, we will propagate the descriptions, tags, and other metadata back to the data sources (termed Reverse Metadata!). This should help provide a consistent and up-to-date view of data across the tools.
  5. Datamesh — We are in the process of supporting Domains and Data Products. Incorporating Datamesh principles into OpenMetadata will be an iterative process as Datamesh concepts themselves are evolving both in theory and practice.

Our roadmap is typically planned for two releases to help us continuously incorporate community needs and feedback. You can find it here.

We are grateful to the OpenMetadata community for being a part of this eventful journey so far. Your participation, feature recommendations, code contributions, and awesome feedback have made a significant difference to the progress of OpenMetadata.

Please reach out to us on Slack if you have any questions about code, installation, and docs. For feature requests, please file a GitHub issue or reach out to us on Slack. Interested in contributing code? Here are some good starting issues to get you going.


OpenMetadata Release 1.0 — The Journey So Far was originally published in OpenMetadata on Medium, where people are continuing the conversation by highlighting and responding to this story.