Table of Contents
Enroll Here: Spark MLlIB Cognitive Class Exam Quiz Answers
Introduction to Spark MLlIB
Apache Spark MLlib (Machine Learning Library) is a scalable machine learning library provided by Apache Spark, which is an open-source cluster computing system. MLlib provides various machine learning algorithms and utilities that simplify the process of building machine learning pipelines.
Key Features of Spark MLlib:
- Scalability: MLlib is designed to scale out to handle large-scale data processing. It leverages the distributed computing capabilities of Apache Spark, making it suitable for big data analytics.
- Rich set of Algorithms: MLlib includes a wide range of algorithms and tools for:
- Classification
- Regression
- Clustering
- Collaborative filtering
- Dimensionality reduction
- Feature extraction, transformation, and selection
- Integration with Spark: MLlib seamlessly integrates with other components of Apache Spark, such as Spark SQL for data manipulation and Spark Streaming for real-time data processing. This integration allows for building end-to-end data pipelines.
- Ease of Use: It provides high-level APIs in Python, Scala, Java, and R, making it accessible to a broad audience of developers and data scientists. These APIs abstract away the complexities of distributed computing, allowing users to focus on building and deploying machine learning models.
- Pipeline API: MLlib includes a Pipeline API that facilitates the construction, tuning, and evaluation of machine learning workflows. This API supports feature extraction, transformation, model training, and evaluation in a sequential manner.
Components of Spark MLlib:
- Core Algorithms: Includes fundamental algorithms such as linear regression, logistic regression, decision trees, random forests, k-means clustering, and more.
- Feature Transformers: Tools for feature extraction, transformation (scaling, normalization), and selection (PCA, feature hashing).
- Pipeline: Provides a framework for constructing machine learning pipelines that orchestrate multiple stages of data processing, feature engineering, and model training.
- Persistence: Allows models to be saved and loaded for reuse or deployment in production systems.
Supported Algorithms:
MLlib supports a variety of algorithms across different categories:
- Supervised Learning: Regression, classification, ranking.
- Unsupervised Learning: Clustering, topic modeling.
- Collaborative Filtering: Recommendation systems.
- Dimensionality Reduction: Principal Component Analysis (PCA).
In summary, Apache Spark MLlib is a powerful library for scalable machine learning on big data. It provides a rich set of algorithms, integration with Apache Spark, and ease of use through high-level APIs, making it suitable for a wide range of machine learning applications.
Spark MLlIB Cognitive Class Certification Answers
Module 1 – Spark MLlIB Data Types Quiz Answers
Question 1: Sparse Data generally contains many non-zero values, and few zero values.
- True
- False
Question 2: Local matrices are generally stored in distributed systems and rarely on single machines.
- True
- False
Question 3: Which of the following are distributed matrices?
- Row Matrix
- Column Matrix
- Coordinate Matrix
- Spherical Matrix
- Row Matrix and Coordinate Matrix
- All of the Above
Module 2 – Review Alogrithms Quiz Answers
Question 1: Logistic Regression is an algorithm used for predicting numerical values.
- True
- False
Question 2: The SVM algorithm maximizes the margins between the generated hyperplane and two clusters of data.
- True
- False
Question 3: Which of the following is true about Gaussian Mixture Clustering?
- The closer a data point is to a particular centroid, the more likely that data point is to be clustered with that centroid.
- The Gaussian of a centroid determines the probability that a data point is clustered with that centroid.
- The probability of a data point being clustered with a centroid is a function of distance from the point to the centroid.
- Gaussian Mixture Clustering uses multiple centroids to cluster data points.
- All of the Above
Module 3 – Spark MLlIB Decision Trees and Random Forests Quiz Answers
Question 1: Which of the following is a stopping parameter in a Decision Tree?
- The number of nodes in the tree reaches a specific value.
- The depth of the tree reaches a specific value.
- The breadth of the tree reaches a specific value.
- All of the Above
Question 2: When using a regression type of Decision Tree or Random Forest, the value for impurity can be measured as either ‘entropy’ or ‘variance’.
- True
- False
Question 3: In a Random Forest, featureSubsetStrategy is considered a stopping parameter, but not a tunable parameter.
- True
- False
Module 4 – Spark MLlIB Clustering Quiz Answers
Question 1: In Spark MLlib, the initialization mode for the K-Means training method is called
- k-means–
- k-means++
- k-means||
- k-means
Question 2: In K-Means, the “runs” parameter determines the number of data points allowed in each cluster.
- True
- False
Question 3: In Gaussian Mixture Clustering, the sum of all values outputted from the “weights” function must equal 1.
- True
- False
Spark MLlIB Final Exam Answers
Question 1: In Gaussian Mixture Clustering, the predictSoft function provides membership values from the top three Gaussians only.
- True
- False
Question 2: In Decision Trees, what is true about the size of a dataset?
- Large datasets create “bins” on splits, which can be specified with the maxBins parameter.
- Large datasets sort feature values, then use the ordered values as split calculations.
- Small datasets create split candidates based on quantile calculations on a sample of the data.
- Small datasets split on random values for the feature.
Question 3: A Logistic Regression algorithm is ineffective as a binary response predictor.
- True
- False
Question 4: What is the Row Pointer for a Matrix with the following Row Indices: [5, 1 | 6 | 2, 8, 10]
- [1, 6]
- [0, 2, 3, 6]
- [0, 2, 3, 5]
- [2, 3]
Question 5: For multiclass classification, try to use (M-1) Decision Tree split candidates whenever possible.
- True
- False
Question 6: In a Decision Tree, choosing a very large maxDepth value can:
- Increase accuracy
- Increase the risk of overfitting to the training set
- Increase the cost of training
- All of the Above
- Increase the risk of overfitting and increase the cost of training
Question 7: In Gaussian Mixture Clustering, a large value returned from the weights function represents a large precedence of that Gaussian.
- True
- False
Question 8: Increasing the value of epsilon when creating the K-Means Clustering model can:
- Decrease training cost and decrease the number of iterations that the model undergoes
- Decrease training cost and increase the number of iterations that the model undergoes
- Increase training cost and decrease the number of iterations that the model undergoes
- Increase training cost and increase the number of iterations that the model undergoes
Question 9: In order to train a machine learning model in Spark MLlib, the dataset must be in the form of a(n)
- Python List
- Textfile
- CSV file
- RDD
Question 10: What is true about Dense and Sparse Vectors?
- A Dense Vector can be created using a csc_matrix, and a Sparse Vector can be created using a Python List.
- A Dense Vector can be created using a SciPy csc_matrix, and a Sparse Vector can be created using a SciPy NumPy Array.
- A Dense Vector can be created using a Python List, and a Sparse Vector can be created using a SciPy csc_matrix.
- A Dense Vector can be created using a SciPy NumPy Array, and a Sparse Vector can be created using a Python List.
Question 11: In a Decision Tree, increaing the maxBins parameter allows for more splitting candidates.
- True
- False
Question 12: In classification models, the value for the numClasses parameter does not depend on the data, and can change to increase model accuracy.
- True
- False
Question 13: What is true about Labeled Points?
- A – A labeled point is used with supervised machine learning, and can be made using a dense local vector.
- B – A labeled point is used with unsupervised machine learning, and can be made using a dense local vector.
- C – A labeled point is used with supervised machine learning, and can be made using a sparse local vector.
- D – A labeled point is used with unsupervised machine learning, and can be made using a sparse local vector
- All of the Above
- A and C only
Question 14: In the Gaussian Mixture Clustering model, the convergenceTol value is a stopping parameter that can be tuned, similar to epsilon in k-means clustering.
- True
- False
Question 15: In Gaussian Mixture Clustering, the “Gaussians” function outputs the coordinates of the largest Gaussian, as well as the standard deviation for each Gaussian in the mixture.
- True
- False
Question 16: What is true about the maxDepth parameter for Random Forests?
- A large maxDepth value is preferred since tree averaging yields a decrease in overall bias.
- A large maxDepth value is preferred since tree averaging yields a decrease in overall variance.
- A large maxDepth value is preferred since tree averaging yields an increase in overall bias.
- A large maxDepth value is preferred since tree averaging yields an increase in overall variance.