Mastering Apache Spark
Format: PDF / Kindle (mobi) / ePub
Gain expertise in processing and storing data by using advanced techniques with Apache Spark
About This Book
- Explore the integration of Apache Spark with third party applications such as H20, Databricks and Titan
- Evaluate how Cassandra and Hbase can be used for storage
- An advanced guide with a combination of instructions and practical examples to extend the most up-to date Spark functionalities
Who This Book Is For
If you are a developer with some experience with Spark and want to strengthen your knowledge of how to get around in the world of Spark, then this book is ideal for you. Basic knowledge of Linux, Hadoop and Spark is assumed. Reasonable knowledge of Scala is expected.
What You Will Learn
- Extend the tools available for processing and storage
- Examine clustering and classification using MLlib
- Discover Spark stream processing via Flume, HDFS
- Create a schema in Spark SQL, and learn how a Spark schema can be populated with data
- Study Spark based graph processing using Spark GraphX
- Combine Spark with H20 and deep learning and learn why it is useful
- Evaluate how graph storage works with Apache Spark, Titan, HBase and Cassandra
- Use Apache Spark in the cloud with Databricks and AWS
Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. It operates at unprecedented speeds, is easy to use and offers a rich set of data transformations.
This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality. The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural net for handwriting recognition. You will then discover how stream processing can be tuned for optimal performance and to ensure parallel processing. The book extends to show how to incorporate H20 for machine learning, Titan for graph based storage, Databricks for cloud-based Spark. Intermediate Scala based code examples are provided for Apache Spark module processing in a CentOS Linux and Databricks cloud environment.
Style and approach
This book is an extensive guide to Apache Spark modules and tools and shows how Spark's functionality can be extended for real-time processing and storage with worked examples.
Hernan Amiune at http://hernan.amiune.com/. The following example considers the classification of emails as spam. If you have 100 e-mails then perform the following: 60% of emails are spam 80% of spam emails contain the word buy 20% of spam emails don't contain the word buy 40% of emails are not spam 10% of non spam emails contain the word buy 90% of non spam emails don't contain the word buy Thus, convert this example into probabilities, so that a Naïve Bayes equation can be created.
neuron j (if it is not an input layer neuron) is given by F(Net), the squashing function, which will be described next. A simulated neuron needs a firing mechanism, which decides whether the inputs to the neuron have reached a threshold. And then, it fires to create the activation value for that neuron. This firing or squashing function can be described by the generalized sigmoid function shown in the following figure: This function has two constants: A and B; B affects the shape of the
environment setup of the Gremlin shell, but you still want to be able to quickly re-execute a script. This code snippet has been executed from a Bash shell script, as can be seen in the next example. The following script uses the titan.sh script to manage the Gremlin server: #!/bin/bash TITAN_HOME=/usr/local/titan/ cd $TITAN_HOME bin/titan.sh start bin/gremlin.sh << EOF t = TitanFactory.open('cassandra.properties') GraphOfTheGodsFactory.load(t) t.close() EOF bin/titan.sh stop A third method
password. This will allow you to access the Databricks web-based user interface, as shown in the following screenshot: This is the welcome screen. It shows the menu bar at the top of the image, which, from left to right, contains the menu, search, help, and account icons. While using the system, there may also be a clock-faced icon that shows the recent activity. From this single interface, you may search through help screens, and usage examples before creating your own clusters and code.
function, and then showing mount directories—s3data and s3data1. These were the directories created during the previous Scala S3 mount example. Dbutils fsutils The fsutils group of functions, within the dbutils package, covers functions such as cp, head, mkdirs, mv, put, and rm. The help calls, shown previously, can provide more information about them. You can create a directory on DBFS using the mkdirs call, as shown next. Note that I have created a number of directories under dbfs:/,