Home » Spark MLlIB Cognitive Class Exam Answers

Spark MLlIB Cognitive Class Exam Answers

by IndiaSuccessStories
0 comment

Introduction to Spark MLlIB

Apache Spark MLlib (Machine Learning Library) is a scalable machine learning library provided by Apache Spark, which is an open-source cluster computing system. MLlib provides various machine learning algorithms and utilities that simplify the process of building machine learning pipelines.

Key Features of Spark MLlib:

  1. Scalability: MLlib is designed to scale out to handle large-scale data processing. It leverages the distributed computing capabilities of Apache Spark, making it suitable for big data analytics.
  2. Rich set of Algorithms: MLlib includes a wide range of algorithms and tools for:
    • Classification
    • Regression
    • Clustering
    • Collaborative filtering
    • Dimensionality reduction
    • Feature extraction, transformation, and selection
  3. Integration with Spark: MLlib seamlessly integrates with other components of Apache Spark, such as Spark SQL for data manipulation and Spark Streaming for real-time data processing. This integration allows for building end-to-end data pipelines.
  4. Ease of Use: It provides high-level APIs in Python, Scala, Java, and R, making it accessible to a broad audience of developers and data scientists. These APIs abstract away the complexities of distributed computing, allowing users to focus on building and deploying machine learning models.
  5. Pipeline API: MLlib includes a Pipeline API that facilitates the construction, tuning, and evaluation of machine learning workflows. This API supports feature extraction, transformation, model training, and evaluation in a sequential manner.

Components of Spark MLlib:

  • Core Algorithms: Includes fundamental algorithms such as linear regression, logistic regression, decision trees, random forests, k-means clustering, and more.
  • Feature Transformers: Tools for feature extraction, transformation (scaling, normalization), and selection (PCA, feature hashing).
  • Pipeline: Provides a framework for constructing machine learning pipelines that orchestrate multiple stages of data processing, feature engineering, and model training.
  • Persistence: Allows models to be saved and loaded for reuse or deployment in production systems.

Supported Algorithms:

MLlib supports a variety of algorithms across different categories:

  • Supervised Learning: Regression, classification, ranking.
  • Unsupervised Learning: Clustering, topic modeling.
  • Collaborative Filtering: Recommendation systems.
  • Dimensionality Reduction: Principal Component Analysis (PCA).

In summary, Apache Spark MLlib is a powerful library for scalable machine learning on big data. It provides a rich set of algorithms, integration with Apache Spark, and ease of use through high-level APIs, making it suitable for a wide range of machine learning applications.

Spark MLlIB Cognitive Class Certification Answers

Question 1: Sparse Data generally contains many non-zero values, and few zero values.

banner
  • True
  • False

Question 2: Local matrices are generally stored in distributed systems and rarely on single machines.

  • True
  • False

Question 3: Which of the following are distributed matrices?

  • Row Matrix
  • Column Matrix
  • Coordinate Matrix
  • Spherical Matrix
  • Row Matrix and Coordinate Matrix
  • All of the Above

Question 1: Logistic Regression is an algorithm used for predicting numerical values.

  • True
  • False

Question 2: The SVM algorithm maximizes the margins between the generated hyperplane and two clusters of data.

  • True
  • False

Question 3: Which of the following is true about Gaussian Mixture Clustering?

  • The closer a data point is to a particular centroid, the more likely that data point is to be clustered with that centroid.
  • The Gaussian of a centroid determines the probability that a data point is clustered with that centroid.
  • The probability of a data point being clustered with a centroid is a function of distance from the point to the centroid.
  • Gaussian Mixture Clustering uses multiple centroids to cluster data points.
  • All of the Above

Question 1: Which of the following is a stopping parameter in a Decision Tree?

  • The number of nodes in the tree reaches a specific value.
  • The depth of the tree reaches a specific value.
  • The breadth of the tree reaches a specific value.
  • All of the Above

Question 2: When using a regression type of Decision Tree or Random Forest, the value for impurity can be measured as either ‘entropy’ or ‘variance’.

  • True
  • False

Question 3: In a Random Forest, featureSubsetStrategy is considered a stopping parameter, but not a tunable parameter.

  • True
  • False

Question 1: In Spark MLlib, the initialization mode for the K-Means training method is called

  • k-means–
  • k-means++
  • k-means||
  • k-means

Question 2: In K-Means, the “runs” parameter determines the number of data points allowed in each cluster.

  • True
  • False

Question 3: In Gaussian Mixture Clustering, the sum of all values outputted from the “weights” function must equal 1.

  • True
  • False

Question 1: In Gaussian Mixture Clustering, the predictSoft function provides membership values from the top three Gaussians only.

  • True
  • False

Question 2: In Decision Trees, what is true about the size of a dataset?

  • Large datasets create “bins” on splits, which can be specified with the maxBins parameter.
  • Large datasets sort feature values, then use the ordered values as split calculations.
  • Small datasets create split candidates based on quantile calculations on a sample of the data.
  • Small datasets split on random values for the feature.

Question 3: A Logistic Regression algorithm is ineffective as a binary response predictor.

  • True
  • False

Question 4: What is the Row Pointer for a Matrix with the following Row Indices: [5, 1 | 6 | 2, 8, 10]

  • [1, 6]
  • [0, 2, 3, 6]
  • [0, 2, 3, 5]
  • [2, 3]

Question 5: For multiclass classification, try to use (M-1) Decision Tree split candidates whenever possible.

  • True
  • False

Question 6: In a Decision Tree, choosing a very large maxDepth value can:

  • Increase accuracy
  • Increase the risk of overfitting to the training set
  • Increase the cost of training
  • All of the Above
  • Increase the risk of overfitting and increase the cost of training

Question 7: In Gaussian Mixture Clustering, a large value returned from the weights function represents a large precedence of that Gaussian.

  • True
  • False

Question 8: Increasing the value of epsilon when creating the K-Means Clustering model can:

  • Decrease training cost and decrease the number of iterations that the model undergoes
  • Decrease training cost and increase the number of iterations that the model undergoes
  • Increase training cost and decrease the number of iterations that the model undergoes
  • Increase training cost and increase the number of iterations that the model undergoes

Question 9: In order to train a machine learning model in Spark MLlib, the dataset must be in the form of a(n)

  • Python List
  • Textfile
  • CSV file
  • RDD

Question 10: What is true about Dense and Sparse Vectors?

  • A Dense Vector can be created using a csc_matrix, and a Sparse Vector can be created using a Python List.
  • A Dense Vector can be created using a SciPy csc_matrix, and a Sparse Vector can be created using a SciPy NumPy Array.
  • A Dense Vector can be created using a Python List, and a Sparse Vector can be created using a SciPy csc_matrix.
  • A Dense Vector can be created using a SciPy NumPy Array, and a Sparse Vector can be created using a Python List.

Question 11: In a Decision Tree, increaing the maxBins parameter allows for more splitting candidates.

  • True
  • False

Question 12: In classification models, the value for the numClasses parameter does not depend on the data, and can change to increase model accuracy.

  • True
  • False

Question 13: What is true about Labeled Points?

  • A – A labeled point is used with supervised machine learning, and can be made using a dense local vector.
  • B – A labeled point is used with unsupervised machine learning, and can be made using a dense local vector.
  • C – A labeled point is used with supervised machine learning, and can be made using a sparse local vector.
  • D – A labeled point is used with unsupervised machine learning, and can be made using a sparse local vector
  • All of the Above
  • A and C only

Question 14: In the Gaussian Mixture Clustering model, the convergenceTol value is a stopping parameter that can be tuned, similar to epsilon in k-means clustering.

  • True
  • False

Question 15: In Gaussian Mixture Clustering, the “Gaussians” function outputs the coordinates of the largest Gaussian, as well as the standard deviation for each Gaussian in the mixture.

  • True
  • False

Question 16: What is true about the maxDepth parameter for Random Forests?

  • A large maxDepth value is preferred since tree averaging yields a decrease in overall bias.
  • A large maxDepth value is preferred since tree averaging yields a decrease in overall variance.
  • A large maxDepth value is preferred since tree averaging yields an increase in overall bias.
  • A large maxDepth value is preferred since tree averaging yields an increase in overall variance.

You may also like

Leave a Comment

Indian Success Stories Logo

Indian Success Stories is committed to inspiring the world’s visionary leaders who are driven to make a difference with their ground-breaking concepts, ventures, and viewpoints. Join together with us to match your business with a community that is unstoppable and working to improve everyone’s future.

Edtior's Picks

Latest Articles

Copyright © 2024 Indian Success Stories. All rights reserved.