Visual Data Shaping Workspace
Running Natively on Hadoop
BigDataMonkey C4 is an intuitive browser based visual workspace for shaping data to meet new analytics requirements. Data is ingested into Hadoop direct from the source in raw form and profiled to analyze content, structure and semantics. Data relationships and inconsistencies are detected enabling the platform to automatically clean, integrate and restructure information to match the desired format. Shapes are then shared, published and distributed to deliver useful information to your business intelligence, analytics and modeling tools or your data warehouse. BigDataMonkey C4 runs directly on Hadoop keeping HCatalog up to date and allowing other tools and applications easy access to clean shaped data.
Navigate And Automate Your Data
- Define, interrogate and connect diverse data sources
- Ingest data in raw form from databases, files or cloud services
- Profile data to assess content, structure and semantics
- Identify data relationships based on schema, semantics, similarity and usage
- Cleanse data to conform to target specifications
- Visually graph data relationships, shapes, processes, filters and expressions
- Define data shapes with intuitive drag and drop tools
- Generate, visualize and sample target shapes
- Operationalize processes and queries for scheduling and production use
- Improve shaping performance through machine learning
- Operate natively on Hadoop using HDFS, MR, Pig, Hive, Sqoop, Flume
- Track and maintain meta data with HCatalog
- Collaborate easily over a browser
- Run in the cloud or on premise
- Handle inconsistent poly structured data
- Track execution history & data lineage
- and much more …
Analytics Requires Rectangular Data
The desired format for practically all analytics uses is clean data organized into rows and columns like a spreadsheet. This is called rectangular data. It can get extremely wide, long and sparse given big data use cases. The challenge is to take inconsistent data in various structures from disparate sources and turn it into consistent combined rectangular data. We call this process data shaping since it shapes data into useful rectangles that match the required dimensions and structure.
People Shape Data Today
Analysts and other data specialists have to shape data manually by profiling, filtering, querying, visualizing, and inspecting the data using various analysis and quality tools. It is an iterative, tedious and time consuming process. While people are best suited to find insight and meaning in data, machines are much better suited to detect structures, inconsistencies, semantics and joinable keys. BigDataMonkey C4 presents the analysts with the patterns and relationships revealed in the data and recommends the best approach to generate the requested shape.
Big Data Requires Automation
The current manual shaping process is untenable in a data driven world. With the volume, variety and velocity of big data along with the ever increasing use cases and applications, there are not enough people to shape all this data. With advances in parallel computing and the lower cost of memory and storage, it has become economically and technically feasible to automate the iterative profiling, querying, filtering and matching processes that are needed to shape data automatically.
Automation You Can See
Based on the statistics, semantics, similarity, usage and other learned data characteristics, analysts can visually navigate through data and transformations to define the shape they want. Patent pending shaping algorithms generate the required processes and queries to create the requested shape. With repeated profiling and shaping, the automated processes are validated and reinforced using machine learning to improve future shaping. Analytics demands transparency, so BigDataMonkey uses innovative visualizations to trace the data and processes from source to target.
Hadoop Makes It Possible
BigDataMonkey C4 runs natively on Hadoop. Hadoop is an open source data platform used by Yahoo, Facebook, Twitter and other big data leaders. It runs in parallel on commodity hardware so it is highly scalable and low cost. Unlike a database which requires data to be in the right format, Hadoop can store and process data in any format, even unstructured data. Before Hadoop, data had to be prepared before it could be stored and queried. Now with Hadoop, BigDataMonkey C4 can easily ingest, profile, cleanse and shape any amount of data in any format.
Technology Stacked on Hadoop
BigDataMonkey C4 provides a browser based HTML5 workspace that executes all processing directly on the Hadoop cluster. HDFS, Map Reduce, Pig scripts, Hive queries, Sqoop, Flume and other commands, scripts and configuration files are automatically generated and deployed on the Hadoop cluster. Processes are initiated through generated Oozie scripts and tracked through ongoing process monitoring. Metadata is stored and managed directly in HCatalog. Data is loaded directly to/from the Hadoop cluster enabling parallel loading across multiple nodes. When it comes to Hadoop, BigDataMonkey supports the widest range of capabilities and libraries.
Use Your Analytics Tools
BigDataMonkey does not develop analytics and visualization tools. There are already many great technologies, and you have your preferred platforms. BigDataMonkey gets the data, shapes it and loads it where your own analytics tools can access it. So you can pick the tools you want and keep using the tools you know.
Data Lineage: Know Where Your Data Came From
BigDataMonkey C4 traces the lineage of data and processing. As data is shaped, the sources, targets, and transformations are tracked back to the original raw source and down to the attribute level. Analysts can navigate from data to process to data to process all the way back to understand where the data came from and why it came out the way it did.