Big Data Resources
Some of the Big key resources and projects on the web to guide your Big Data strategy going into 2023. Below are the great number of interesting articles, tutorials, links, and resources on various Big Data technologies.
Big Data Awesome list on Github.
Apache Big Data Stack, from Spark to Hadoop to NiFI, Kafka, … Everything is under the Apache umbrella.
- Apache NIFI, Hortonworks is backing this awesome GUI driven big data project.
- Apache Zeppelin, a very cool Big Data notebook.
- Apache Geode, again from my friends at Pivotal. This is an awesome in-memory data grid, commercially known as Gemfire.
- Apache Airavata, multitasking supertool.
- Apache DataFu, best named Apache Big Data Project in my mind.
- Apache Crunch, stays crispy in milk, map reduce and Spark. You had me at crunch.
- Apache Falcon, an interesting data management project.
- Apache Flink, the superfast squirrel that came out of nowhere and exploded.
- Apache Tajo, distributed relational datawarehouse on Hadoop. Used inCDAP, not sure how this isn’t huge yet.
- Apache Phoenix, fast relational layer over HBase.
- Apache HAWQ, fast MPP SQL on Hadoop, open sourced from Pivotal.
- Apache Giraph,high scalability graphing system, adds to the huge list of graph processing solutions out there.
- Apache Hama, is a BSP framework for Big Data Analytics. This one is still being baked, but could be insanely useful. I am waiting and watching this one.
- Apache Helix, clustering and partioning solution that works withZookeeper.
- Apache MetaModel, a common interface to a ton of different data sources including HBase, RDBMS and NOSQL stores.
- Apache ORC, yet another file format. Also,Apache Parquet and Apache Avro.
- Apache MADlib, Machine Learning in SQL on Postgresql, Greenplum and HAWQ.
- Apache Gora, in data memory model.
- Apache Twill, layer over Yarn.
- Apache Accumulo, key-value store on HDFS with cell level security.
- Apache Drill, SQL ontop of NoSQL, Hadoop and RDBMS.
- Apache Chuka, analysis and monitoring for Hadoop.
- Apache Ambari, the slick install, configuration and administration tool for Hadoop.
- Apache Slider, not a small hamburger but a framework on top of Yarn for better clustering.
- Apache Storm, distributed real-time computation framework that is widely used with Hadoop.
- Apache Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
- RHadoop Installation Guide for Red Hat Enterprise Linux
Graph and Network Analysis
- Presentations and talks on Apache Giraph, an iterative graph processing system built for high scalability
- Serious network analysis using Hadoop and Neo4j
List of resources that will help you learn more about these tools, and more.
- 35 Free Data Sources
- 19 Free Public Data Sets for Your First Data Science Project
- A Case for Database Development
- Data Analyst vs Data Scientist – What are the Differences?
- The Importance of Data Science Careers
- Apache Geode (incubating) | Home
- Stream Processing with Apache Flink | Coding
- Data Exploration Using Spark
- Apache Eagle – Secure Hadoop in Real Time
- The Hadoop Ecosystem Table
- Stock Inference by Pivotal-Open-Source-Hub
- Data Architectures for Robust Decision Making
- A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
- Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of S…
- Building a REST Job Server for interactive Spark as a service by Roma…
- Slides | Databricks
- Spark Summit EU 2015: Lessons from 300 production users
- Spark – The Ultimate Scala Collections by Martin Odersky
- Spark Usage · mongodb/mongo-hadoop Wiki
- Scala School – Java Scala
- Spark SQL and DataFrames – Spark 1.5.1 Documentation
- Configuration – Spark 1.5.1 Documentation
- Quick Start – Spark 1.5.1 Documentation
- SystemML – developerWorks Open
- Getting Started with Apache Spark