Filling the Knowledge Gap
The biggest challenge in working with Big Data on Hadoop is a lack of knowledgeable resources. Hadoop requires specialized skills and experience that are in short supply. There are training programs on Hadoop as a technology, but how does your team apply Hadoop to:
- Ingest raw source files and still ensure quality, consistency and completeness
- Integrate new data with traditional enterprise data into a manageable Data Lake
- Enhance unstructured log, messaging and sensor data with sufficient structure to enable analysis
- Shape data to support advanced analytics disciplines like predictive modeling and machine learning
- Derive more value from Big Data by satisfying real world, forward thinking needs
These questions require personnel that not only know Big Data and Hadoop but also understand enterprise processes and how to shape data to support innovative use cases.
Capability and Conviction Deliver Big Ideas
The BigDataMonkey team has years of experience in ETL, data warehousing, business intelligence, analytics and enterprise software along with extensive training and expertise in Hadoop. We are fully skilled in the primary Hadoop libraries including Pig, Hive, Map Reduce, Oozie, Sqoop, Flume and others. We are thought leaders in semantic analysis, schema matching, machine learning and how to apply these approaches to data shaping. We believe data shaping can and should be automated to allow our customers to try new things with more data more quickly. Our capability and conviction combined with truly amazing technology make big ideas possible.
Process to Ensure Success
Poorly executed data processes can wreak havoc on your analysis and operations. The risks are greater using new technologies so it is critical to follow an effective methodology with continuous delivery and review. BigDataMonkey follows a disciplined agile delivery methodology both in product development and customer delivery. We discover and develop data shaping processes through iterative sprints that continuously validate each increment. Each data shaping project moves quickly through the phases on the right with defined success criteria for each phase and gate.
Working in Partnership
BigDataMonkey works in partnership with our customers to ensure the products we deliver support the solutions our customers need both now and in the future. Our team works jointly with your technical and business analysts to develop solutions that deliver demonstrable business value. We transition our know how so your team becomes self-sufficient. There’s no better way to learn than by doing, especially with people who know what they are doing.
Having the Right Tool Helps
The BigDataMonkey platform quickly delivers usable data to an analytical project by automating the ingestion, profiling, cleansing and shaping of data. This is typically the largest task in any data driven initiative. The web based visual workbench makes data and process configuration intuitive and allows easy sharing and coordination across teams and locations. By automating data shaping, BigDataMonkey accelerates delivery of innovative analytics solutions and allows teams to use a more agile approach trying different techniques with different data.
Source Thought Delivery Phases
- Discovery – Identify target use cases, data sources and shaping characteristics to determine how Hadoop and BigDataMonkey can best be leveraged. Conduct discussions with the business and technical stakeholders to understand the sources and targets and assess the requirements and potential benefits. Prepare a plan to POC or pilot high value data and shapes into a data pond.
- POC (Proof of Concept) – Define the scope and configuration for the POC Hadoop cluster to create the data pond either on Amazon EC2 or alternatively on premise or another cloud provider. Conduct a one-time POC data load from key source systems. Use the BigDataMonkey platform to profile POC data and identify semantics, inconsistencies and relationships. Use the browser based visual workbench to review and define essential data shapes and generate the associated processes and queries. Test shaped data in analytics, modeling or/or warehousing platforms. Review the process to assess project feasibility and prepare a plan for pilot deployment.
- Pilot – Define the scope for production deployment of one or more complete data flows onto a pilot production Hadoop environment. Use BigDataMonkey or an ETL tool to configure and test required data ingestion processes from the different sources into the Hadoop cluster. Use BigDataMonkey to build and test required data shaping processes. Migrate from build to test to pilot environments. Integrate shaped data with the analytics, modeling or/or warehousing platform. Review, benchmark and plan for ongoing production deployment.
- Production – Tune and transition the pilot to ongoing production on the final Hadoop, cloud, analytics, modeling and/or warehouse environments. Operations, monitoring, scheduling, maintenance and support are migrated to production support. Procedures, open issues and documentation are handed over.