Oct-2022 Databricks Databricks-Certified-Professional-Data-Scientist Actual Questions and Braindumps
Databricks-Certified-Professional-Data-Scientist Dumps To Pass Databricks Exam in 24 Hours - DumpsFree
NEW QUESTION 84
In which of the scenario you can use the regression to predict the values
- A. All 1 ,2 and 3
- B. Samsung can use it for mobile sales forecast
- C. Probability of the celebrity divorce
- D. Mobile companies can use it to forecast manufacturing defects
- E. Only 1 and 2
Answer: A
Explanation:
Explanation
Regression is a tool which Companies may use this for things such as sales forecasts or forecasting manufacturing defects. Another creative example is predicting the probability of celebrity divorce.
NEW QUESTION 85
Refer to the exhibit.
You are building a decision tree. In this exhibit, four variables are listed with their respective values of info-gain.
Based on this information, on which attribute would you expect the next split to be in the decision tree?
- A. Gender
- B. Income
- C. Age
- D. Credit Score
Answer: D
NEW QUESTION 86
Which of the following question statement falls under data science category?
- A. How many products have been sold in a last month?
- B. Where is a problem for sales?
- C. Which is the optimal scenario for selling this product?
- D. What happens, if these scenario continues?
- E. What happened in last six months?
Answer: C,D
Explanation:
Explanation
This question wants to check your understanding about Bl and Data Science. Bl was already existing and analytics team already using it. They need to improve and learn data science technique to solve some problems. If you check the option given in the question, it will confuse you. But if you have worked in Bl or as a Data Scientist then it is easy to answer. First 3 option can be easily answered using reporting solution, what sales happened in last six month, what was the problem etc.
But for the last two option you need to apply data science techniques like which all scenarios are optimal for product sales, you need to collect the data and applying various techniques for that. Hence, last two option can only be answered using Data Science technique And for this you need to apply techniques like Optimization, predictive modeling, statistical analysis on structured and un-structured data.
NEW QUESTION 87
Which activity is performed in the Operationalize phase of the Data Analytics Lifecycle?
- A. Try different variables
- B. Define the process to maintain the model
- C. Transform existing variables
- D. Try different analytical techniques
Answer: B
Explanation:
Explanation
Operationalize In the final phase, the team communicates the benefits of the project more broadly and sets up a pilot project to deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of users. In Phase 4. the team scored the model in the analytics sandbox.
NEW QUESTION 88
Which of the following steps you will be using in the discovery phase?
- A. What Unix server capacity required?
- B. What is the network capacity required
- C. What all are the data sources for the project?
- D. What all tools are required, in the project?
- E. Analyze the Raw data and its format and structure.
Answer: A,B,C,D,E
Explanation:
Explanation
During the discovery phase you need to find how much resources are required as early as possible and for that even you can involve various stakeholders like Software engineering team, DBAs, Network engineers, System administrators etc. for your requirement and these resources are already available or you need to procure them. Also, what would be source of the data? What all tools and software's are required to execute the same?
NEW QUESTION 89
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.
Above is an example of
- A. Linear Regression
- B. Logistic Regression
- C. Maximum likelihood estimation
- D. Hierarchical linear models
- E. Recommendation system
Answer: B
Explanation:
Explanation
Logistic regression
Pros: Computationally inexpensive, easy to implement, knowledge representation easy to interpret Cons: Prone to underfitting, may have low accuracy Works with: Numeric values, nominal values
NEW QUESTION 90
A data scientist is asked to implement an article recommendation feature for an on-line magazine.
The magazine does not want to use client tracking technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article is available for making recommendations. All of the magazine's articles are stored in a database in a format suitable for analytics.
Which method should the data scientist try first?
- A. K Means Clustering
- B. Naive Bayesian
- C. Association Rules
- D. Logistic Regression
Answer: A
Explanation:
Explanation
kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. You can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroids, and for the maximum number of iterations.
Clustering is primarily an exploratory technique to discover hidden structures of the data: possibly as a prelude to more focused analysis or decision processes. Some specific applications of k-means are image processing^ medical and customer segmentation. Clustering is often used as a lead-in to classification. Once the clusters are identified, labels can be applied to each cluster to classify each group based on its characteristics. Marketing and sales groups use k-means to better identify customers who have similar behaviors and spending patterns.
NEW QUESTION 91
Select the correct statement which applies to Supervised learning
- A. Lesser machine's task to only divining some pattern from the input data to get the target variable
- B. We asks the machine to learn from our data when we specify a target variable.
- C. Instead of telling the machine Predict Y for our data X, we're asking What can you tell me about X?
Answer: A,B,C
Explanation:
Explanation : Supervised learning asks the machine to learn from our data when we specify a target variable.
This reduces the machine's task to only divining some pattern from the input data to get the target variable.
In unsupervised learning we don't have a target variable as we did in classification and regression.
Instead of telling the machine Predict Y for our data X> we're asking What can you tell me about X?
Things we ask the machine to tell us about
X may be What are the six best groups we can make out of X? or What three features occur together most frequently in X?
NEW QUESTION 92
Question-13. Which of the following is not the Classification algorithm?
- A. Logistic Regression
- B. Support Vector Machine
- C. Neural Network
- D. None of the above
- E. Hidden Markov Models
Answer: D
Explanation:
Explanation
Logistic regression
Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables that may be either numerical or categories.
Support Vector Machines
As with naive Bayes, Support Vector Machines (or SVMs) can be used to solve the task of assigning objects to classes. But the way this task is solved is completely different to the setting in naive Bayes.
Neural Network
Neural Networks are a means for classifying multidimensional objects.
Hidden Markov Models
Hidden Markov Models are used in multiple areas of machine learning, such as speech recognition, handwritten letter recognition, or natural language processing.
NEW QUESTION 93
If E1 and E2 are two events, how do you represent the conditional probability given that E2 occurs given that E1 has occurred?
- A. P(E1)/P(E2)
- B. P(E1+E2)/P(E1)
- C. P(E2)/P(E1)
- D. P(E2)/(P(E1+E2)
Answer: C
NEW QUESTION 94
Which technique you would be using to solve the below problem statement? "What is the probability that individual customer will not repay the loan amount?"
- A. Linear Regression
- B. Logistic Regression
- C. Classification
- D. Clustering
- E. Hypothesis testing
Answer: B
NEW QUESTION 95
You are doing advanced analytics for the one of the medical application using the regression and you have two variables which are weight and height and they are very important input variables, which cannot be ignored and they are also highly co-related. What is the best solution for that?
- A. You will take square of the height.
- B. You will take square root of weight
- C. You will take cube root of height
- D. You would consider using BMI (Body Mass Index)
Answer: D
Explanation:
Explanation
If multiple variables are highly co-related then it is better you consider using the either of the variable which correlates more (which is not in the given option) or go for the new variable which is a function of the both the variable in this case it could be BMI (Body Mass Index). Because it is a function of both weight and height as per the below formula. BMI = Weight/(Height * Height)
NEW QUESTION 96
In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?
- A. Communicate Results
- B. Model Building
- C. Data Preparation
- D. Discovery
Answer: C
NEW QUESTION 97
You have modeled the datasets with 5 independent variables called A,B,C,D and E having relationships which is not dependent each other, and also the variable A,B and C are continuous and variable D and E are discrete (mixed mode).
Now you have to compute the expected value of the variable let say A, then which of the following computation you will prefer
- A. Differentiation
- B. Generalization
- C. Transformation
- D. Integration
Answer: D
Explanation:
Explanation
Text Description automatically generated
Text Description automatically generated
Text Description automatically generated
NEW QUESTION 98
What describes a true limitation of Logistic Regression method?
- A. It does not have explanatory values.
- B. It does not handle missing values well.
- C. It does not handle redundant variables well.
- D. It does not handle correlated variables well.
Answer: B
NEW QUESTION 99
Select the correct objectives of principal component analysis
- A. To identify new meaningful underlying variables
- B. To discover the dimensionality of the data set
- C. All 1, 2 and 3
- D. To reduce the dimensionality of the data set
- E. Only 1 and 2
Answer: C
Explanation:
Explanation
Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible: and each succeeding component accounts for as much of the remaining variability as possible.
Objectives of principal component analysis
1. To discover or to reduce the dimensionality of the data set.
2. To identify new meaningful underlying variables.
NEW QUESTION 100
You are creating a regression model with the input income, education and current debt of a customer, what could be the possible output from this model.
- A. Customer fit as acceptable or average category
- B. Customer fit as a good
- C. expressed as a percent, that the customer will default on a loan
- D. 2 and 3 are correct
- E. 1 and 3 are correct
Answer: C
Explanation:
Explanation
Regression is the process of using several inputs to produce one or more outputs. For example The input might be the income, education and current debt of a customer The output might be the probability, expressed as a percent that the customer will default on a loan. Contrast this to classification where the output is not a number, but a class.
NEW QUESTION 101
You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?
- A. Decrease the number of clusters
- B. Identify additional measures to add to the analysis
- C. Remove one of the measures
- D. Increase the number of clusters
Answer: A
NEW QUESTION 102
You are having 1000 patients' data with the height and age. Where age in years and height in meters. You wanted to create cluster using this two attributes. You wanted to have near equal effect for both the age and height while creating the cluster. What you can do?
- A. You will be taking square root of height
- B. You will be adding height with the numeric value 100
- C. You will be dividing both age and height with their respective standard deviation
- D. You will be converting each height value to centimeters
Answer: C,D
Explanation:
Explanation
When you see the data age in years would have values like 50, 60r 70 90 years etc. And while calculating distance from centroid maximum possible value can be 90-0 and its square will be 8100.
While using heights in meter can be 2-0.5(1.5) meters and its square will be 2.25 only. So you can see age has more effect than height. Hence bringing the height on same level you can convert it into centimeters. Can bring data upto 200 centimeters and then it be more effective like square of 200 maximum.
However there is another approach is to divide the each value with its standard deviation, which will not have impact of the units e.g. age/sd of the age, which results in value without unit. This can also help in reducing the effect of units.
NEW QUESTION 103
Select the correct algorithm of unsupervised algorithm
- A. Naive Bayes
- B. Support Vector Machines
- C. K-Nearest Neighbors
- D. K-Means
Answer: C
Explanation:
Explanation
Sup Supervised learning tasks
Classification Regression
k-Nearest Neighbors Linear
Naive Bayes Locally weighted linear
Support vector machines Ridge
Decision trees Lasso
Unsupervised learning tasks Clustering Density estimation k-Means Expectation maximization DBSCAN Parzen window
NEW QUESTION 104
A data scientist wants to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate method for this project?
- A. Linear regression
- B. Logistic regression
- C. Apriori algorithm
- D. K-means clustering
Answer: B
Explanation:
Explanation
Logistic regression is used widely in many fields, including the medical and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict whether a patient has a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.; age, blood cholesterol level, systolic blood pressure, relative weight, blood hemoglobin level, smoking (at 3 levels), and abnormal electrocardiogram.).Another example might be to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.[citation needed] In economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing.
NEW QUESTION 105
Refer to the Exhibit.
In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data?
- A. Tree D
- B. Tree B
- C. Tree C
- D. Tree A
Answer: B
NEW QUESTION 106
......
Download the Latest Databricks-Certified-Professional-Data-Scientist Dump - 2022 Databricks-Certified-Professional-Data-Scientist Exam Question Bank: https://prep4sure.dumpsfree.com/Databricks-Certified-Professional-Data-Scientist-valid-exam.html