Table of Contents
Enroll Here: Data Science with Scala Cognitive Class Exam Quiz Answers
Introduction to Data Science with Scala
Scala runs on the Java Virtual Machine (JVM) and seamlessly interoperates with Java libraries, making it a versatile choice for data science applications. It offers concise syntax, immutability by default, and strong static typing, which are beneficial for writing robust and scalable data processing code.
Tools and Libraries
- Apache Spark: Scala is the primary language for Apache Spark, a fast and general-purpose cluster computing system. Spark’s RDD (Resilient Distributed Dataset) abstraction makes it suitable for handling large-scale data processing tasks like ETL, machine learning, and real-time analytics.
- Breeze: Breeze is a numerical processing library for Scala that provides support for linear algebra, numerical computing, and signal processing. It’s particularly useful for implementing algorithms in data science workflows.
- ScalaNLP: ScalaNLP is a library for natural language processing tasks in Scala. It includes tools for tokenization, stemming, part-of-speech tagging, and other NLP tasks, making it valuable for text mining and sentiment analysis.
Functional Programming Paradigm
Scala promotes a functional programming style, which emphasizes immutability, higher-order functions, and declarative code. This paradigm is well-suited for data transformation tasks and enhances code readability and maintainability.
Data Science Workflow in Scala
- Data Collection and Cleaning: Use libraries like Apache Spark for data ingestion, cleaning, and preprocessing. Scala’s strong type system helps catch errors at compile-time, ensuring cleaner data pipelines.
- Exploratory Data Analysis (EDA): Leverage Scala libraries such as Breeze and data visualization tools like Apache Zeppelin or Jupyter notebooks (using Scala kernels) for exploratory data analysis.
- Machine Learning: Implement machine learning algorithms using Spark MLlib, which provides scalable implementations of popular algorithms like classification, regression, clustering, and collaborative filtering.
- Model Deployment: Scala can be used to deploy models into production, leveraging frameworks like Akka for building reactive, distributed systems.
By leveraging Scala’s strengths in functional programming and its ecosystem of powerful libraries like Spark, Breeze, and ScalaNLP, you can effectively tackle various challenges in data science—from data preprocessing to building and deploying machine learning models.
Data Science with Scala Cognitive Class Certification Answers
Module 1 – Basic Statistics and Data Types Quiz Answers
Question 1: You import MLlib’s vectors from?
- org.apache.spark.mllib.TF
- org.apache.spark.mllib.numpy
- org.apache.spark.mllib.linalg
- org.apache.spark.mllib.pandas
Question 2: Select the types of distributed Matrices:
- Row Matrix
- Indexed Row Matrix
- Coordinate Matrix
Question 3: How would you caculate the mean of the following?
val observations: RDD[Vector] = sc.parallelize(Array(
Vectors.dense(1.0, 2.0),
Vectors.dense(4.0, 5.0),
Vectors.dense(7.0, 8.0)))
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
- summary.normL1
- summary.numNonzeros
- summary.mean
- summary.normL2
Question 4: what task does the following lines of code?
import org.apache.spark.mllib.random.RandomRDDs._
val million = poissonRDD(sc, mean=1.0, size=1000000L, numPartitions=10)
- calculate the variance
- calculate the mean
- generate random samples
- Calculate the variance
Question 5: MLlib uses the compressed sparse column format for sparse matrices, as Such it only keeps the non-zero entrees?
- True
- False
Module 2 – Preparing Data Quiz Answers
Question 1: WFor a dataframe object the method describe calculates the?
- count
- mean
- standard deviation
- ma
- min
- all of the above
Question 2: What line of code drops the rows that contain null values, select the best answer?
- val dfnan = df.withColumn(“nanUniform”, halfTonNaN(df(“uniform”)))
- dfnan.na.replace(“uniform”, Map(Double.NaN -> 0.0))
- dfnan.na.drop(minNonNulls = 3)
- dfnan.na.fill(0.0)
Question 3: What task does the following lines of code perform?
val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)
val model1 = lr.fit(training)
- perform one hot encoding
- Train a linear regression model
- Train a Logistic regression model
- Perform PCA on the data
Question 4: The StandardScaleModel transforms the data such that?
- each feature has a max value of 1
- each feature is Orthogonal
- each feature to have a unit standard deviation and zero mean
- each feature has a min value of -1
Module 3 – Feature Engineering Quiz Answers
Question 1: Spark ML works with?
- tensors
- vectors
- dataframes
- lists
Question 2: the function IndexToString() performs One hot encoding?
- True
- False
Question 3: Principal Component Analysis is Primarily used for?
- to convert categorical variables to integers
- to predict discrete values
- dimensionality reduction
Question 4: one import set prior to using PCA is?
- normalizing your data
- making sure every feature is not correlated
- taking the log for your data
- subtracting the mean
Module 4 – Fitting a Model Quiz Answers
Question 1: You can use decision trees for?
- regression
- classification
- classification and regression
- data normalization
Question 2: the following lines of code: val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
- split the data into training and testing data
- train the model
- use 70% of the data for testing
- use 30% of the data for training
- make a prediction
Question 3: in the Random Forest Classifier constructor .setNumTrees()?
- sets the max depth of trees
- sets the minimum number of classes before a split
- set the number of trees
Question 4: Elastic net regularization uses?
- L0-norm
- L1-norm
- L2-norm
- a convex combination of the L1 norm and L2 norm
Module 5 – Pipeline and Grid Search Quiz Answers
Question 1: what task does the following code perform: withColumn(“paperscore”, data(“A2”) * 4 + data(“A”) * 3)?
- add 4 colunms to A2
- add 3 colunms to A1
- add 4 to each elment in colunm A2
- assign a higher weight to A2 and A journals
Question 2: In an estimator?
- there is no need to call the method fit
- fit function is called
- transform function is only called
Question 3: Which is not a valid type of Evaluator in MLlib?
- Regression Evaluator
- Multi-Class Classification Evaluator
- Multi-Label Classification Evaluator
- Binary Classification Evaluator
- All are valid
Question 4: In the following lines of code, the last transform in the pipeline is a:
val rf = new RandomForestClassifier().setFeaturesCol(“assembled”).setLabelCol(“status”).setSeed(42)
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(value_band_indexer,category_indexer,label_indexer,assembler,rf))
- principal component analysis
- Vector Assembler
- String Indexer
- Vector Assembler
- Random Forest Classifier
Data Science with Scala Final Exam Answers
Question 1: What is not true about labeled points?
- They associate dense vectors with a corresponding label/response
- They associate sparse vectors with a corresponding label/response
- They are used in unsupervised machine learning algorithms
- All are true
- None are true
Question 2: Which is true about column pointers in sparse matrices?
- They have the same number of values as the number of columns
- They never repeat values
- By themselves, they do not represent the specific physical location of a value in the matrix
- All are true
- None are true
Question 3: What is the name of the most basic type of distributed matrix?
- Coordinate Matrix
- Indexed Row Matrix
- Simple Matrix
- Row Matrix
- Sparse Matrix
Question 4: A perfect correlation is represented by what value?
- 100
- 3
- 1
- 0
- -1
Question 5: A MinMaxScaler is a transformer which:
- Rescales each feature to a specific range
- Takes no parameters
- Makes zero values remain untransformed
- All are true
- None are true
Question 6: Which is not a supported Random Data Generation distribution?
- Exponential
- Uniform
- Delta
- Normal
- Poisson
Question 7: Sampling without replacement means:
- The expected size of the sample is the same as the RDDs size
- The expected number of times each element is chosen is randomized
- The expected size of the sample is unknown
- The expected size of the sample is a fraction of the RDDs size
- The expected number of times each element is chosen
Question 8: What are the supported types of hypothesis testing?
- Kolmogorov-Smirnov test for equality of distribution
- Pearson’s Chi-Squared Test for goodness of fit
- Pearson’s Chi-Squared Test for independence
- All are supported
- None are supported
Question 9: For Kernel Density Estimation, which kernel is supported by Spark?
- KDEMultivariate
- KDEUnivariate
- KernelDensity
- Gaussian
- All are supported
Question 10: Which DataFrames statistics method computes the pairwise frequency table of the given columns?
- freqItems()
- crosstab()
- cov()
- pairwiseFreq()
- corr()
Question 11: Which is not true about the fill method for DataFrame NA functions?
- It is used for replacing null values
- It is used for replacing nil values
- It is used for replacing NaN values
- All are true
- None are true
Question 12: Which transformer listed below is used for Natural Language Processing?
- OneHotEncoder
- ElementwiseProduct
- Normalizer
- StandardScaler
- None are used for Natural Language Processing
Question 13: Which is true about the Mahalanobis Distance?
- It is a scale-variant distance
- It is a multi-dimensional generalization of measuring how many standard deviations a point is away from the median
- It is measured along each Principle Component axis
- It has units of distance
- It does not take into account the correlations of the dataset
Question 14: Which is true about OneHotEncoder?
- It creates a Sparse Vector
- It must be told which column to create for its output
- It must be told which column is its input
- All are true
- None are true
Question 15: Principle Component Analysis is:
- A dimension reduction technique
- Is never used for feature engineering
- Used for supervised machine learning
- All are true
- None are true
Question 16: MLlib’s implementation of decision trees:
- Partitions data by rows, allowing distributed training
- Supports only multiclass classification
- Does not support regressions
- Supports only continuous features
- None are true
Question 17: Which is not a tunable of SparkML decision trees?
- maxMemoryInMB
- minInfoGain
- minDepth
- maxBins
- minInstancesPerNode
Question 18: Which is true about Random Forests?
- They support non-categorical features
- They combine many decision trees in order to reduce the risk of overfitting
- They do not support regression
- They only support binary classification
- None are true
Question 19: When comparing Random Forest versus Gradient-Based Trees, what must you consider?
- Parallelization abilities
- Depth of Trees
- How the number of trees affects the outcome
- All of these
- None of these
Question 20: Which is not a valid type of Evaluator in MLlib?
- Multi-Class Classification Evaluator
- Binary Classification Evaluator
- Regression Evaluator
- Multi-Label Classification Evaluator
- All are valid