Peng, R. and Matsui, E. (2016) The Art of Data Science, lulu.com
Download eBook PDF (PDF 6,365KB)
Data analysis is a difficult process largely because few people can describe exactly how to do it. It’s not that there aren’t any people doing data analysis on a regular basis. It’s that the process by which we state a question, explore data, conduct formal modeling, interpret results, and communicate findings, is a difficult process to generalize and abstract. Fundamentally, data analysis is an art. It is not yet something that we can easily automate. Data analysts have many tools at their disposal, from linear regression to classification trees to random forests, and these tools have all been carefully implemented on computers. But ultimately, it takes a data analyst—a person—to find a way to assemble all of the tools and apply them to data to answer a question of interest to people.
This book writes down the process of data analysis with a minimum of technical detail. What we describe is not a specific “formula” for data analysis, but rather is a general process that can be applied in a variety of situations. Through our extensive experience both managing data analysts and conducting our own data analyses, we have carefully observed what produces coherent results and what fails to produce useful insights into data. This book is a distillation of our experience in a format that is applicable to both practitioners and managers in data science.
Cielen, D. , Meysman, A. and Ali, Mohamed, A. (2016) Introducing Data Science: Big Data, Machine Learning, and more, using Python tools, Manning Publications
Download eBook PDF (PDF 14,945KB)
Download Source Code Part 1 (ZIP KB)
Download Source Code Part 2 (ZIP KB)
Download PDF Overview of the Data Science Process (PDF 279KB)
Data Science has become one of the hottest fields in technology. Firms worldwide are scrambling to find developers with data science skills to work on projects ranging from social media marketing to machine learning, but the prerequisite knowledge and experience for this career can seem bewildering. This book is designed to help anyone who wants to learn more about data science get started.
Introducing Data Science teaches readers how to accomplish the fundamental tasks that occupy data scientists. They’ll use the Python language and common Python libraries as they experience firsthand the challenges of dealing with data at scale. They’ll discover how Python allows them to gain insights from huge data sets that need to be stored on multiple machines, or for data moving at such speed no single machine can handle it. After reading this book, readers will have a solid foundation to consider a career in data science.
Many companies need developers with data science skills to work on projects ranging from social media marketing to machine learning. Discovering what you need to learn to begin a career as a data scientist can seem bewildering. This book is designed to help you get started.
Introducing Data Science explains vital data science concepts and teaches you how to accomplish the fundamental tasks that occupy data scientists. You’ll explore data visualization, graph databases, the use of NoSQL, and the data science process. You’ll use the Python language and common Python libraries as you experience firsthand the challenges of dealing with data at scale. Discover how Python allows you to gain insights from data sets so big that they need to be stored on multiple machines, or from data moving so quickly that no single machine can handle it. This book gives you hands-on experience with the most popular Python data science libraries, Scikit-learn and StatsModels. After reading this book, you’ll have the solid foundation you need to start a career in data science.
What’s inside
- Handling large data
- Introduction to machine learning
- Using Python to work with data
- Writing data science algorithms
Table of Contents
1. Data science in a Big Data world
1.1. Benefits and uses of data science and Big Data
1.2. Facets of data
1.2.1. Structured data
1.2.2. Unstructured data
1.2.3. Natural language
1.2.5. Graph-based or network data
1.2.6. Audio, image, and video
1.3. The data science process
1.3.1. Setting the research goal
1.3.2. Retrieving data
1.3.3. Data cleansing
1.3.4. Data exploration
1.3.5. Data modeling or model building
1.3.6. Presentation and automation
1.4. The Big Data ecosystem and data science
1.4.1. Distributed file systems
1.4.2. Distributed programming framework
1.4.3. Data integration framework
1.4.4. Machine learning frameworks
1.4.5. NoSQL databases
1.4.6. Scheduling tools
1.4.7. Benchmarking tools
1.4.8. System deployment
1.4.9. Service programming
1.4.10. Security
1.5. An introductory working example of Hadoop
1.6. Summary
2. The data science process
2.1. Overview of the data science process
2.1.1. Don’t be a slave to the process
2.2. Step 1: defining research goals and creating a project charter
2.2.1. Spend time understanding the goals and context of your research
2.2.2. Create a project charter
2.3. Step 2: retrieving data
2.3.1. Start with data stored within the company
2.3.2. Don’t be afraid to shop around
2.3.3. Do data quality checks now to prevent problems later
2.4. Step 3: cleansing, integrating, and transforming data
2.4.1. Cleansing data
2.4.2. Correct errors as early as possible
2.4.3. Combining data from different data sources
2.4.4. Transforming data
2.5. Step 4: exploratory data analysis
2.6. Step 5: Build the models
2.6.1. Model and variable selection
2.6.2. Model execution
2.6.3. Model diagnostics and model comparison
2.7. Step 6: Presenting findings and building applications on top of them
2.8. Summary
3. Machine learning
3.1. What is machine learning and why should you care about it?
3.1.1. Applications for machine learning in data science
3.1.2. Where machine learning is used in the data science process
3.1.3. Python tools used in machine learning
3.2. The modelling process
3.2.1. Engineering features and selecting a model
3.2.2. Training your model
3.2.3. Validating a model
3.2.4. Predicting new observations
3.3. Types of machine learning
3.3.1. Supervised learning
3.3.2. Unsupervised learning
3.4. Semi-supervised learning
3.5. Summary
4. Handling large data on a single computer
4.1. The problems you face when handling large data
4.2. General techniques for handling large volumes of data
4.2.1. Choosing the right algorithm
4.2.2. Choosing the right data structure
4.2.3. Selecting the right tools
4.3. General programming tips for dealing with large datasets
4.3.1. Don’t reinvent the wheel
4.3.2. Get the most out of your hardware
4.3.3. Reduce your computing needs
4.4. Case study 1: predicting malicious URLs
4.4.1. Step 1: defining the research goal
4.4.2. Step 2: acquiring the URL data
4.4.3. Step 4 of the data science process: data exploration
4.4.4. Step 5 of data science process: model building
4.5. Case study 2: building a recommender system inside a database
4.5.1. Tools and techniques needed
4.5.2. Step 1 of the data science process: research question
4.5.3. Step 3 of the data science process: data preparation
4.5.4. Step 5 of the data science process: model building
4.5.5. Step 6 of data science process: presentation and automation
4.6. Summary
5. First steps in Big Data
5.1. Distributing data storage and processing with frameworks
5.1.1. Hadoop: a framework for storing and processing large datasets
5.1.2. Now, keeping the workings of Hadoop in mind, let’s look at Spark: replacing MapReduce for better performance
5.2. Case study: assessing risk when loaning money
5.2.1. Part 1 of data science process: the research goal
5.2.2. Part 2 of data science process: data retrieval
5.2.3. Part 3 of data science process: data preparation
5.2.4. Step 4: data exploration & step 6: report building
5.3. Summary
6. Join the NoSQL movement
6.1. Introduction to NoSQL
6.1.1. ACID: the core principle of relational databases
6.1.2. CAP Theorem: the problem with DBs on many nodes
6.1.3. The BASE principles of NoSQL databases
6.1.4. NoSQL database types
6.2. Case study: what disease is that?
6.2.1. Step 1: setting the research goal
6.2.2. Steps 2 and 3: data retrieval and preparation
6.2.3. Step 4: data exploration
6.2.4. Step 3 revisited: data preparation for disease profiling
6.2.5. Step 4 revisited: data exploration for disease profiling
6.2.6. Step 6: presentation and automation
6.3. Summary
7. The rise of graph databases
7.1. Introducing connected data and graph databases
7.1.1. Why and when should I use a graph database?
7.2. Introducing Neo4j: a graph database
7.2.1. Cypher: a graph query language
7.3. Connected data example: a recipe recommendation engine
7.3.1. Step 1: setting the research goal
7.3.2. Step 2: data retrieval
7.3.3. Step 3: data preparation
7.3.4. Step 4: data exploration
7.3.5. Step 5: data modeling
7.3.6. Step 6: presentation
7.4. Summary
8. Text mining and text analytics
8.1. Text mining in the real world
8.2. Text mining techniques
8.2.1. Bag of words
8.2.2. Stemming and lemmatization
8.2.3. Decision tree classifier
8.3. Case study: classifying Reddit posts
8.3.1. Meet the Natural Language Toolkit
8.3.2. Data science process overview and step 1: the research goal
8.3.3. Step 2: data retrieval
8.3.4. Step 3: data preparation
8.3.5. Step 4: data exploration
8.3.6. Step 3 revisited: data preparation adapted
8.3.7. Step 5: data analysis
8.3.8. Step 6: presentation and automation
8.4. Summary
9. Data visualization to the end user
9.1. Data visualization options
9.2. Crossfilter, the JavaScript MapReduce library
9.2.1. Setting everything up
9.2.2. Unleashing Crossfilter to filter the medicine dataset
9.3. Create an interactive dashboard with dc.js
9.4. Dashboard development tools
9.5. Summary
Appendixes
Appendix A: Setting up Elasticsearch
A.1. Linux Installation
A.2. Windows Installation
Appendix B: Setting up Neo4j
B.1. Linux Installation
B.2. Windows Installation
Appendix C: Installing MySQL server
C.1. Windows Installation
C.2. Linux Installation
Appendix D: Setting up anaconda with virtual environment
D.1. Linux Installation
D.2. Windows Installation
D.3. Setting up the Environment
The big data ecosystem and data science
(http://freecontent.manning.com/the-big-data-ecosystem-and-data-science/)
The big data ecosystem can be grouped into technologies that have similar goals and functionalities. In this article, we’ll explore those technologies.
This article is excerpted from Introducing Data Science.
Sikos, L. (2015) Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data, Apress
Download eBook PDF (PDF 9.457KB)
Download Source Code (ZIP 30KB)
Companion Web Site: http://www.lesliesikos.com/mastering-structured-data-on-the-semantic-web/
A major limitation of conventional web sites is their unorganized and isolated contents, which is created mainly for human consumption. This limitation can be addressed by organizing and publishing data, using powerful formats that add structure and meaning to the content of web pages and link related data to one another. Computers can “understand” such data better, which can be useful for task automation. The web sites that provide semantics (meaning) to software agents form the Semantic Web, the Artificial Intelligence extension of the World Wide Web. In contrast to the conventional Web (the “Web of Documents”), the Semantic Web includes the “Web of Data”, which connects “things” (representing real-world humans and objects) rather than documents meaningless to computers. Mastering Structured Data on the Semantic Web explains the practical aspects and the theory behind the Semantic Web and how structured data, such as HTML5 Microdata and JSON-LD, can be used to improve your site’s performance on next-generation Search Engine Result Pages and be displayed on Google Knowledge Panels. You will learn how to represent arbitrary fields of human knowledge in a machine-interpretable form using the Resource Description Framework (RDF), the cornerstone of the Semantic Web. You will see how to store and manipulate RDF data in purpose-built graph databases such as triplestores and quadstores, that are exploited in Internet marketing, social media, and data mining, in the form of Big Data applications such as the Google Knowledge Graph, Wikidata, or Facebook’s Social Graph.
With the constantly increasing user expectations in web services and applications, Semantic Web standards gain more popularity. This book will familiarize you with the leading controlled vocabularies and ontologies and explain how to represent your own concepts. After learning the principles of Linked Data, the five-star deployment scheme, and the Open Data concept, you will be able to create and interlink five-star Linked Open Data, and merge your RDF graphs to the LOD Cloud. The book also covers the most important tools for generating, storing, extracting, and visualizing RDF data, including, but not limited to, Protégé, TopBraid Composer, Sindice, Apache Marmotta, Callimachus, and Tabulator. You will learn to implement Apache Jena and Sesame in popular IDEs such as Eclipse and NetBeans, and use these APIs for rapid Semantic Web application development. Mastering Structured Data on the Semantic Web demonstrates how to represent and connect structured data to reach a wider audience, encourage data reuse, and provide content that can be automatically processed with full certainty. As a result, your web contents will be integral parts of the next revolution of the Web.
A major limitation of conventional web sites is their unorganized and isolated contents, which is created mainly for human consumption. This limitation can be addressed by organizing and publishing data, using powerful formats that add structure and meaning to the content of web pages and link related data to one another. Computers can “understand” such data better, which can be useful for task automation. The web sites that provide semantics (meaning) to software agents form the Semantic Web, the Artificial Intelligence extension of the World Wide Web. In contrast to the conventional Web (the “Web of Documents”), the Semantic Web includes the “Web of Data”, which connects “things” (representing real-world humans and objects) rather than documents meaningless to computers. Mastering Structured Data on the Semantic Web explains the practical aspects and the theory behind the Semantic Web and how structured data, such as HTML5 Microdata and JSON-LD, can be used to improve your site’s performance on next-generation Search Engine Result Pages and be displayed on Google Knowledge Panels. You will learn how to represent arbitrary fields of human knowledge in a machine-interpretable form using the Resource Description Framework (RDF), the cornerstone of the Semantic Web. You will see how to store and manipulate RDF data in purpose-built graph databases such as triplestores and quadstores, that are exploited in Internet marketing, social media, and data mining, in the form of Big Data applications such as the Google Knowledge Graph, Wikidata, or Facebook’s Social Graph.
With the constantly increasing user expectations in web services and applications, Semantic Web standards gain more popularity. This book will familiarize you with the leading controlled vocabularies and ontologies and explain how to represent your own concepts. After learning the principles of Linked Data, the five-star deployment scheme, and the Open Data concept, you will be able to create and interlink five-star Linked Open Data, and merge your RDF graphs to the LOD Cloud. The book also covers the most important tools for generating, storing, extracting, and visualizing RDF data, including, but not limited to, Protégé, TopBraid Composer, Sindice, Apache Marmotta, Callimachus, and Tabulator. You will learn to implement Apache Jena and Sesame in popular IDEs such as Eclipse and NetBeans, and use these APIs for rapid Semantic Web application development. Mastering Structured Data on the Semantic Web demonstrates how to represent and connect structured data to reach a wider audience, encourage data reuse, and provide content that can be automatically processed with full certainty. As a result, your web contents will be integral parts of the next revolution of the Web.
What you’ll learn
- Extend your markup with machine-readable annotations and get your data to the Google Knowledge Graph
- Represent real-world objects and persons with machine-interpretable code
- Develop Semantic Web applications in Java
- Reuse and interlink structured data and create LOD datasets
Who this book is for
The book is intended for web developers and SEO experts who want to learn state-of-the-art Search Engine Optimization methods using machine-readable annotations and machine-interpretable Linked Data definitions. The book will also benefit researchers interested in automatic knowledge discovery. As a textbook on Semantic Web standards powered by graph theory and mathematical logic, the book could also be used as a reference work for computer science graduates and Semantic Web researchers.
Table of Contents
1. Introduction to the Semantic Web
2. Knowledge Representation
3. Linked Open Data
4. Semantic Web Development Tools
5. Semantic Web Services
6. Graph Databases
7. Querying
8. Big Data Applications
9. Use Cases