Practical Data Science Cookbook - Real-World Data Science Projects to Help You Get Your Hands On Your Data
Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta
Format: PDF / Kindle (mobi) / ePub
- Learn how to tackle every step in the data science pipeline and use it to acquire, clean, analyze, and visualize data
- Get beyond the theory with real-world projects
- Expand your numerical programming skills through step-by-step code examples and learn more about the robust features of R and Python
Data's value has grown exponentially in the past decade, with 'Big Data' today being one of the biggest buzzwords in business and IT, and data scientist hailed as 'the sexiest job of the 21st century'. Practical Data Science Cookbook helps you see beyond the hype and get past the theory by providing you with a hands-on exploration of data science. With a comprehensive range of recipes designed to help you learn fundamental data science tasks, you'll uncover practical steps to help you produce powerful insights into Big Data using R and Python.
Use this valuable data science book to discover tricks and techniques to get to grips with your data. Learn effective data visualization with an automobile fuel efficiency data project, analyze football statistics, learn how to create data simulations, and get to grips with stock market data to learn data modelling. Find out how to produce sharp insights into social media data by following data science tutorials that demonstrate the best ways to tackle Twitter data, and uncover recipes that will help you dive in and explore Big Data through movie recommendation databases.
Practical Data Science Cookbook is your essential companion to the real-world challenges of working with data, created to give you a deeper insight into a world of Big Data that promises to keep growing.
What you will learn
- Follow the recipes in this essential data science cookbook to learn the fundamentals of data science and data analysis
- Put theory into practice with a selection of real-world Big Data projects
- Learn the data science pipeline and successfully structure your data science project
- Find out how to clean, munge, and manipulate data
- Learn different approaches to data modelling and how to determine the most appropriate for your data
- Learn numerical computing with NumPy and SciPy
About the Authors
Tony Ojeda is the founder of District Data Labs, a cofounder of Data Community DC, and is actively involved in promoting data science education through both organizations.
Sean Patrick Murphy spent 15 years as a senior scientist at The Johns Hopkins University Applied Physics Laboratory, where he focused on machine learning, modeling and simulation, signal processing, and high performance computing in the Cloud. Now, he acts as an advisor and data consultant for companies in SF, NY, and DC.
Benjamin Bengfort has worked in military, industry, and academia for the past 8 years. He is currently pursuing his PhD in Computer Science at the University of Maryland, College Park, researching Metacognition and Natural Language Processing.
Abhijit Dasgupta is a data consultant working in the greater DC-Maryland-Virginia area, with several years of experience in biomedical consulting, business analytics, bioinformatics, and bioengineering consulting.
Table of Contents
- Preparing Your Data Science Environment
- Driving Visual Analysis with Automobile Data (R)
- Simulating American Football Data (R)
- Modeling Stock Market Data (R)
- Visually Exploring Employment Data (R)
- Creating Application-oriented Analyses Using Tax Data (Python)
- Driving Visual Analyses with Automobile Data (Python)
- Working with Social Graphs (Python)
- Recommending Movies at Scale (Python)
- Harvesting and Geolocating Twitter Data (Python)
- Optimizing Numerical Code w
the standard Python Read-Eval-Print Loop (REPL), among many other tools. ff IPython Notebook: This offers a browser-based tool to perform and record work done in Python with support for code, formatted text, markdown, graphs, images, sounds, movies, and mathematical expressions. ff pandas: This provides a robust data frame object and many additional tools to make traditional data and statistical analysis fast and easy. ff nose: This is a test harness that extends the unit testing framework
http://cran.r-project.org/doc/ manuals/r-release/R-data.html ff Explore the datatypes in R at http://www.statmethods.net/input/ datatypes.html Exploring and describing fuel efficiency data Now that we have imported the automobile fuel efficiency dataset into R and learned a little about the nuances of importing, the next step is to do some preliminary analysis of the dataset. The purpose of this analysis is to explore what the data looks like and get your feet wet with some of R's most basic
offense$ORushStrength <- (1(offense$ORushStrength/max(offense$ORushStrength)))*100 3. Let's calculate index values for a couple more fields before aggregating them into a single offensive strength value. For example, let's choose points and yards per game: offense$OPPGStrength <- max(offense[,3])-offense[,3] offense$OPPGStrength <- (1(offense$OPPGStrength/max(offense$OPPGStrength)))*100 offense$OYPGStrength <- max(offense[,4])-offense[,4] 75 Simulating American Football Data (R)
phases of the data pipeline. 185 Creating Application-oriented Analyses Using Tax Data (Python) There's more... There are several different Python template languages, each with a different approach to combining a predefined template with data to form human-readable output. Many of these template languages are intended as the backbone of web application frameworks, such as Django and Flask, that are used to construct dynamic web pages from a database. Since these languages are well suited to
will miss out on the excellent Integrated Development Environment (IDE) built for R, called RStudio. Visit http://www. rstudio.com/ide/download/ to download RStudio, and follow the online installation instructions. 5. Once installed, go ahead and run RStudio. The following screenshot shows one of our author's customized RStudio configurations with the Console panel in the upper-left corner, the editor in the upper-right corner, the current variable list in the lower-left corner, and the current