Big Data Resources
Some of the key resources and projects on the web to guide your Big Data strategy going into 2016. Below are the great number of interesting articles, tutorials, links, and resources on various Big Data technologies.
Top Big Data, Data science Books you should read
Maptive has assembled a must follow list of 100 Big Data people on Twitter.
Big Data Awesome list on Github.
A list of must read blogs that intersect Big Data, Cloud, Microservices, Containers, and IoT:
- VMware, they have software and solutions for all the things.
- IoT Central for all things…
- DataBricks are heavy, but when they are Apache Spark, they are a must read.
- Elephants never forget and publish great Big Data content.
- Intel is inside lots of things, but they have awesome Big Data knowledge.
- Spring now is the programming glue for everything without the XML configuration nightmare of yore.
Also read:
Top 12 Hadoop Technology Companies
The Biggest Challenge of Hadoop Analytics: It’s all about Query Performance
Relation Between Big Data Hadoop and Cloud Computing
What Is Hadoop, And How Does It Relate To Cloud?
A Guide to Become a Successful Hadoop Developer in 2022
How To Kick Start Your Career With Hadoop And Big Data Training?
13 Reasons Why System/Data Administrators should do Hadoop Training
Top 10 Tips for Hadoop Administration for Starters
Apache Big Data Stack, from Spark to Hadoop to NiFI, Kafka, … Everything is under the Apache umbrella.
What’s new?
- Apache NIFI, Hortonworks is backing this awesome GUI driven big data project.
- Apache Zeppelin, a very cool Big Data notebook.
- Apache Geode, again from my friends at Pivotal. This is an awesome in-memory data grid, commercially known as Gemfire.
- Apache Airavata, multitasking supertool.
- Apache DataFu, best named Apache Big Data Project in my mind.
- Apache Crunch, stays crispy in milk, map reduce and Spark. You had me at crunch.
- Apache Falcon, an interesting data management project.
- Apache Flink, the superfast squirrel that came out of nowhere and exploded.
- Apache Tajo, distributed relational datawarehouse on Hadoop. Used inCDAP, not sure how this isn’t huge yet.
- Apache Phoenix, fast relational layer over HBase.
- Apache HAWQ, fast MPP SQL on Hadoop, open sourced from Pivotal.
- Apache Giraph,high scalability graphing system, adds to the huge list of graph processing solutions out there.
- Apache Hama, is a BSP framework for Big Data Analytics. This one is still being baked, but could be insanely useful. I am waiting and watching this one.
- Apache Helix, clustering and partioning solution that works withZookeeper.
- Apache MetaModel, a common interface to a ton of different data sources including HBase, RDBMS and NOSQL stores.
- Apache ORC, yet another file format. Also,Apache Parquet and Apache Avro.
- Apache MADlib, Machine Learning in SQL on Postgresql, Greenplum and HAWQ.
- Apache Gora, in data memory model.
- Apache Twill, layer over Yarn.
- Apache Accumulo, key-value store on HDFS with cell level security.
- Apache Drill, SQL ontop of NoSQL, Hadoop and RDBMS.
- Apache Chuka, analysis and monitoring for Hadoop.
- Apache Ambari, the slick install, configuration and administration tool for Hadoop.
- Apache Slider, not a small hamburger but a framework on top of Yarn for better clustering.
- Apache Storm, distributed real-time computation framework that is widely used with Hadoop.
Apache Pig
RHadoop
Graph and Network Analysis
- Presentations and talks on Apache Giraph, an iterative graph processing system built for high scalability
- Serious network analysis using Hadoop and Neo4j
- I Mapreduced a Neo store: Creating large Neo4j Databases with Hadoop
Open Data Platform’s Sandbox for Hadoop.
Pivotal’s Spring Cloud DataFlow
Dell Big Data Resource Library
Huge list of resources that will help you learn more about these tools, and more.
- 35 Free Data Sources
- 19 Free Public Data Sets for Your First Data Science Project
- A Case for Database Development
- Data Analyst vs Data Scientist – What are the Differences?
- GDPR Information Portal
- The Importance of Data Science Careers
- Responding Rapidly When You Have 100GB Data Sets in Java
- Hands-on Tour of Apache Spark in 5 Minutes – Hortonworks
- LucidWorks/banana
- Hortonworks Community Gallery
- http://systemml.apache.org/
- Download Flo for Spring XD — Pivotal Network
- This week in #Scala (16/11/2015)
- Apache Geode (incubating) | Home
- Stream Processing with Apache Flink | Coding
- Introduction to Apache Spark – Ippon USA – Big Data, Digital and Cloud Solu
- Get started with Spark, HDFS and CassandraIppon USA – Big Data, Digital and
- Presto | Distributed SQL Query Engine for Big Data
- http://www.slideshare.net/SparkSummit/intro-to-spark-development
- http://www.slideshare.net/DataFactZ/introduction-to-spark-datafactz
- http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015
- http://www.slideshare.net/databricks/spark-dataframes-simple-and-fast-analytics-on-structured-data-at-spark-summit-2015
- http://www.slideshare.net/SparkSummit/using-spark-with-tachyon-by-gene-pang
- http://www.slideshare.net/databricks/spark-summit-east-2015-advdevopsstudentslides
- http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
- http://www.slideshare.net/databricks/strata-nyc-2015-whats-new-in-spark-streaming
- http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users
- http://www.slideshare.net/SparkSummit/productionizing-spark-and-the-rest-job-server-evan-chan
- http://www.slideshare.net/cloudera/spark-devwebinarslides-final?from_m_app=android
- http://www.slideshare.net/pacoid/crash-introduction-to-apache-spark?from_m_app=android
- Build a CEP App on Apache Spark and Drools
- New in Cloudera Labs: SparkOnHBase – Cloudera Engineering Blog
- Exactly-once Spark Streaming from Apache Kafka – Cloudera Engineering Blog
- Data Exploration Using Spark
- Data Science Cheat Sheet – Data Science Central
- Apache Spark Plugin | Apache Phoenix
- Apache Eagle – Secure Hadoop in Real Time
- Twitter Analytics Example — Tigon 0.2.1 Documentation
- Iterative Data Processing with Apache Spark — Cask Data Application Platfor
- Spark Programs — Cask Data Application Platform 3.2.1 Documentation
- The Hadoop Ecosystem Table
- pachyderm/pachyderm
- Apache HBase – Apache HBase™ Home
- Apache Hive TM
- How we built it: designing a globally consistent transaction engine | Cask
- Zab vs. Paxos – Apache ZooKeeper – Apache Software Foundation
- Scala Crash Course, Part 1 | Coding
- Data Science for Losers, Part 3 – Scala & Apache Spark | Coding
- Scala is not a mythical monster! Using scala with Spring boot
- Install Hadoop on Windows in 3 Easy Steps for Hortonworks Sandbox Tutorial
- A quick tour of JSON libraries in Scala | In translation
- Guides | Cask
- Development Environment Setup — Cask Data Application Platform 3.2.1 Docume
- CDAP Software Development Kit (SDK) — Cask Data Application Platform 3.2.1
- cdap-guides/cdap-twitter-ingest-guide
- Running Legacy MapReduce Jobs in CDAP | Cask Blog
- cdn.oreillystatic.com/en/assets/1/event/129/Apache Spark Tutorial, with dee
- Spark Summit 2014 Training Archive | Spark Summit 2014
- Under the Hood — Coopr 0.9.9 Documentation
- CDAP
- Zeppelin
- Stock Inference by Pivotal-Open-Source-Hub
- Data Architectures for Robust Decision Making
- Trying out Tachyon on Hadoop 2.2 | Sparklandia
- A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
- Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of S…
- Building a REST Job Server for interactive Spark as a service by Roma…
- Slides | Databricks
- Spark Summit Europe
- Spark Summit EU 2015: Lessons from 300 production users
- Spark – The Ultimate Scala Collections by Martin Odersky
- Analytics using Apache Spark and MongoDB
- Spark Usage · mongodb/mongo-hadoop Wiki
- Using MongoDB with Spark | Databricks Blog
- Scala School – Java Scala
- Setting up Hadoop (v2) with Spark (v1) on OSX using Homebrew | Andy Judson’
- giorgioinf/twitter-stream-ml
- Home · fluxcapacitor/pipeline Wiki
- Spark SQL and DataFrames – Spark 1.5.1 Documentation
- Configuration – Spark 1.5.1 Documentation
- Introduction to Machine Learning with Spark (Clustering) | Knoldus
- Using MongoDB with Hadoop & Spark: Part 3 – Spark Example & Key Takeaways |
- Testing Scala Applications with In-memory mongoDB | Knoldus
- A River of Bytes: Efficient Spark SQL Queries to MongoDB
- fogus: The 100:10:1 method: my approach to open source
- spark-redis
- Quick Start – Spark 1.5.1 Documentation
- The second-best feature of Java 8TripAdvisor Engineering Blog
- Using NLP to Find “Interesting” Collections of HotelsTripAdvisor Engineerin
- Using Apache Spark for Massively Parallel NLP at TripAdvisor – Cloudera Eng
- SystemML – developerWorks Open
- www.cs.berkeley.edu/~haoyuan/talks/Tachyon_2014-10-16-Strata.pdf
- Tachyon FAQ
- Getting Started with Apache Spark
- spark-mongodb/First_Steps.rst at master · Stratio/spark-mongodb
- The Bleeding Edge: Spark, Parquet and S3 – AppsFlyer
- How-to: Do Data Quality Checks using Apache Spark DataFrames – Cloudera Eng
- Feature Engineering at Scale With Spark – Eugene Zhulenev
- Optimizing Spark Machine Learning for Small Data – Eugene Zhulenev
- ezhulenev: Interactive Audience Analytics with Spark
- databricks: Audience Modeling With @ApacheSpark ML Pipelines
- Building Twitter Live Stream Analytics With Spark and Cassandra
- https://github.com/palantir/atlasdb/wiki