Spark MLlIB Cognitive Class Exam Answers

Enroll Here: Spark MLlIB Cognitive Class Exam Quiz Answers

Introduction to Spark MLlIB

Apache Spark MLlib (Machine Learning Library) is a scalable machine learning library provided by Apache Spark, which is an open-source cluster computing system. MLlib provides various machine learning algorithms and utilities that simplify the process of building machine learning pipelines.

Key Features of Spark MLlib:

Scalability: MLlib is designed to scale out to handle large-scale data processing. It leverages the distributed computing capabilities of Apache Spark, making it suitable for big data analytics.
Rich set of Algorithms: MLlib includes a wide range of algorithms and tools for:
- Classification
- Regression
- Clustering
- Collaborative filtering
- Dimensionality reduction
- Feature extraction, transformation, and selection
Integration with Spark: MLlib seamlessly integrates with other components of Apache Spark, such as Spark SQL for data manipulation and Spark Streaming for real-time data processing. This integration allows for building end-to-end data pipelines.
Ease of Use: It provides high-level APIs in Python, Scala, Java, and R, making it accessible to a broad audience of developers and data scientists. These APIs abstract away the complexities of distributed computing, allowing users to focus on building and deploying machine learning models.
Pipeline API: MLlib includes a Pipeline API that facilitates the construction, tuning, and evaluation of machine learning workflows. This API supports feature extraction, transformation, model training, and evaluation in a sequential manner.

Components of Spark MLlib:

Core Algorithms: Includes fundamental algorithms such as linear regression, logistic regression, decision trees, random forests, k-means clustering, and more.
Feature Transformers: Tools for feature extraction, transformation (scaling, normalization), and selection (PCA, feature hashing).
Pipeline: Provides a framework for constructing machine learning pipelines that orchestrate multiple stages of data processing, feature engineering, and model training.
Persistence: Allows models to be saved and loaded for reuse or deployment in production systems.

Supported Algorithms:

MLlib supports a variety of algorithms across different categories:

Supervised Learning: Regression, classification, ranking.
Unsupervised Learning: Clustering, topic modeling.
Collaborative Filtering: Recommendation systems.
Dimensionality Reduction: Principal Component Analysis (PCA).

In summary, Apache Spark MLlib is a powerful library for scalable machine learning on big data. It provides a rich set of algorithms, integration with Apache Spark, and ease of use through high-level APIs, making it suitable for a wide range of machine learning applications.

Spark MLlIB Cognitive Class Certification Answers

Module 1 – Spark MLlIB Data Types Quiz Answers

Question 1: Sparse Data generally contains many non-zero values, and few zero values.

True
False

Question 2: Local matrices are generally stored in distributed systems and rarely on single machines.

True
False

Question 3: Which of the following are distributed matrices?

Row Matrix
Column Matrix
Coordinate Matrix
Spherical Matrix
Row Matrix and Coordinate Matrix
All of the Above

Module 2 – Review Alogrithms Quiz Answers

Question 1: Logistic Regression is an algorithm used for predicting numerical values.

True
False

Question 2: The SVM algorithm maximizes the margins between the generated hyperplane and two clusters of data.

True
False

Question 3: Which of the following is true about Gaussian Mixture Clustering?

The closer a data point is to a particular centroid, the more likely that data point is to be clustered with that centroid.
The Gaussian of a centroid determines the probability that a data point is clustered with that centroid.
The probability of a data point being clustered with a centroid is a function of distance from the point to the centroid.
Gaussian Mixture Clustering uses multiple centroids to cluster data points.
All of the Above

Module 3 – Spark MLlIB Decision Trees and Random Forests Quiz Answers

Question 1: Which of the following is a stopping parameter in a Decision Tree?

The number of nodes in the tree reaches a specific value.
The depth of the tree reaches a specific value.
The breadth of the tree reaches a specific value.
All of the Above

Question 2: When using a regression type of Decision Tree or Random Forest, the value for impurity can be measured as either ‘entropy’ or ‘variance’.

True
False

Question 3: In a Random Forest, featureSubsetStrategy is considered a stopping parameter, but not a tunable parameter.

True
False

Module 4 – Spark MLlIB Clustering Quiz Answers

Question 1: In Spark MLlib, the initialization mode for the K-Means training method is called

k-means–
k-means++
k-means||
k-means

Question 2: In K-Means, the “runs” parameter determines the number of data points allowed in each cluster.

True
False

Question 3: In Gaussian Mixture Clustering, the sum of all values outputted from the “weights” function must equal 1.

True
False

Spark MLlIB Final Exam Answers

Question 1: In Gaussian Mixture Clustering, the predictSoft function provides membership values from the top three Gaussians only.

True
False

Question 2: In Decision Trees, what is true about the size of a dataset?

Large datasets create “bins” on splits, which can be specified with the maxBins parameter.
Large datasets sort feature values, then use the ordered values as split calculations.
Small datasets create split candidates based on quantile calculations on a sample of the data.
Small datasets split on random values for the feature.

Question 3: A Logistic Regression algorithm is ineffective as a binary response predictor.

True
False

Question 4: What is the Row Pointer for a Matrix with the following Row Indices: [5, 1 | 6 | 2, 8, 10]

[1, 6]
[0, 2, 3, 6]
[0, 2, 3, 5]
[2, 3]

Question 5: For multiclass classification, try to use (M-1) Decision Tree split candidates whenever possible.

True
False

Question 6: In a Decision Tree, choosing a very large maxDepth value can:

Increase accuracy
Increase the risk of overfitting to the training set
Increase the cost of training
All of the Above
Increase the risk of overfitting and increase the cost of training

Question 7: In Gaussian Mixture Clustering, a large value returned from the weights function represents a large precedence of that Gaussian.

True
False

Question 8: Increasing the value of epsilon when creating the K-Means Clustering model can:

Decrease training cost and decrease the number of iterations that the model undergoes
Decrease training cost and increase the number of iterations that the model undergoes
Increase training cost and decrease the number of iterations that the model undergoes
Increase training cost and increase the number of iterations that the model undergoes

Question 9: In order to train a machine learning model in Spark MLlib, the dataset must be in the form of a(n)

Python List
Textfile
CSV file
RDD

Question 10: What is true about Dense and Sparse Vectors?

A Dense Vector can be created using a csc_matrix, and a Sparse Vector can be created using a Python List.
A Dense Vector can be created using a SciPy csc_matrix, and a Sparse Vector can be created using a SciPy NumPy Array.
A Dense Vector can be created using a Python List, and a Sparse Vector can be created using a SciPy csc_matrix.
A Dense Vector can be created using a SciPy NumPy Array, and a Sparse Vector can be created using a Python List.

Question 11: In a Decision Tree, increaing the maxBins parameter allows for more splitting candidates.

True
False

Question 12: In classification models, the value for the numClasses parameter does not depend on the data, and can change to increase model accuracy.

True
False

Question 13: What is true about Labeled Points?

A – A labeled point is used with supervised machine learning, and can be made using a dense local vector.
B – A labeled point is used with unsupervised machine learning, and can be made using a dense local vector.
C – A labeled point is used with supervised machine learning, and can be made using a sparse local vector.
D – A labeled point is used with unsupervised machine learning, and can be made using a sparse local vector
All of the Above
A and C only

Question 14: In the Gaussian Mixture Clustering model, the convergenceTol value is a stopping parameter that can be tuned, similar to epsilon in k-means clustering.

True
False

Question 15: In Gaussian Mixture Clustering, the “Gaussians” function outputs the coordinates of the largest Gaussian, as well as the standard deviation for each Gaussian in the mixture.

True
False

Question 16: What is true about the maxDepth parameter for Random Forests?

A large maxDepth value is preferred since tree averaging yields a decrease in overall bias.
A large maxDepth value is preferred since tree averaging yields a decrease in overall variance.
A large maxDepth value is preferred since tree averaging yields an increase in overall bias.
A large maxDepth value is preferred since tree averaging yields an increase in overall variance.