digitalgurusacademy.in 0 Comments

100 Job-Ready Data Scientist Interview Questions and Answers for Beginners 2024

Table of Contents

1. What is Data Science?

Data Science is an interdisciplinary field that employs scientific methods, algorithms, and systems to extract insights from structured and unstructured data. It includes data collection, preprocessing, exploratory data analysis (EDA), model building using machine learning algorithms, and interpreting and communicating findings.

2. What is a Data Scientist Roadmap?

A Data Scientist roadmap typically outlines the skills and knowledge areas needed to become proficient in data science. This includes mastering programming languages (e.g., Python, R), statistics, machine learning, data visualization, and tools like SQL, along with gaining experience in real-world data projects.

3. What is the difference between supervised and unsupervised learning?

Supervised learning trains a model on labeled data, allowing it to learn the mapping from inputs to outputs. In contrast, unsupervised learning works with unlabeled data, focusing on finding patterns or relationships without predefined labels or outputs.

4. What is the difference between Data Science and Data Analytics?

Data Science encompasses the entire data lifecycle, from collection to interpretation, and often involves machine learning. Data Analytics focuses primarily on analyzing historical data to identify trends, make predictions, and support business decision-making.

5. What is the difference between variance and bias?

Variance measures a model’s sensitivity to changes in the training data, while bias refers to the model’s assumptions that may lead to errors. A high variance may result in overfitting, while high bias can lead to underfitting. Finding the right balance is essential for a well-performing model.

6. What is overfitting, and how can you avoid it?

Overfitting happens when a model captures noise in the training data, reducing its ability to generalize to new data. To avoid overfitting, techniques such as regularization, cross-validation, and choosing simpler models can be applied.

7. What is the curse of dimensionality?

The curse of dimensionality refers to the difficulty and computational challenges that arise as the number of features (dimensions) increases in a dataset. High-dimensional data often leads to sparse data and performance issues. Dimensionality reduction techniques like PCA can help mitigate these challenges.

8. What is regularization, and why is it useful?

Regularization introduces a penalty to the model’s loss function to discourage overly complex models, which helps prevent overfitting and improves the model’s generalization to unseen data.

9. What is the difference between L1 and L2 regularization?

L1 regularization (Lasso) adds the absolute value of the coefficients to the loss function, encouraging sparsity by setting some coefficients to zero. L2 regularization (Ridge) adds the square of the coefficients, shrinking them but not necessarily to zero. Both methods help prevent overfitting.

10. What is the difference between a generative and discriminative model?

A generative model learns the joint probability distribution of the input features and labels, allowing it to generate new samples. A discriminative model, on the other hand, focuses on learning the decision boundary between classes to classify data.

11. What is cross-validation, and why is it important?

Cross-validation is a technique for evaluating a model by splitting the dataset into training and testing sets multiple times. This provides a more robust assessment of a model’s performance and reduces the risk of overfitting to a specific dataset.

12. Explain the bias-variance tradeoff and its implications for model selection?

The bias-variance tradeoff involves balancing the simplicity of a model (bias) and its sensitivity to training data (variance). A model with high bias may underfit, while high variance may overfit. The goal is to minimize both to achieve optimal model performance.

13. Discuss the challenges and potential solutions for handling imbalanced datasets?

Imbalanced datasets often lead to models favoring the majority class. Solutions include resampling (oversampling the minority class or undersampling the majority class), using evaluation metrics like precision and recall, or employing techniques like ensemble methods to address the imbalance.

14. Describe the process of building and evaluating a recommender system?

Building a recommender system involves gathering data, preprocessing, choosing an algorithm (collaborative filtering or content-based filtering), training the model, and evaluating it using metrics such as precision, recall, and Mean Average Precision (MAP).

15. Explain the concept of dimensionality reduction and its applications in data analysis?

Dimensionality reduction techniques, such as PCA or t-SNE, aim to reduce the number of features while preserving important information. This helps improve computational efficiency, model performance, and visualization of high-dimensional data.

16. Discuss the ethical considerations involved in using machine learning models?

Ethical considerations include ensuring fairness, transparency, and privacy. Bias in training data, lack of interpretability, and potential negative social impacts must be addressed by using fairness-aware algorithms, diverse datasets, and clear documentation of model decisions.

17. Explain the concept of streaming analytics and its applications in real-time data processing?

Streaming analytics involves processing data in real-time as it is generated. Applications include fraud detection, IoT device monitoring, and dynamic pricing. Technologies like Apache Kafka and Apache Flink are commonly used for real-time data processing.

18. Describe the challenges and potential solutions for deploying machine learning models in production?

Deploying machine learning models in production includes challenges such as scalability, monitoring, and version control. Solutions include using containers (e.g., Docker), implementing monitoring systems, and managing model versions effectively.

19. Explain the concept of deep learning and its applications in various domains?

Deep learning, a subset of machine learning, involves neural networks with many layers to learn complex patterns. It is applied in areas such as image recognition, natural language processing (NLP), and speech recognition. Popular architectures include CNNs and RNNs.

20. Describe the concept of transfer learning and its advantages in machine learning?

Transfer learning involves using a pre-trained model for a related task, reducing the need for large datasets and training time. It is especially useful when labeled data is limited, and it enhances model performance in various applications.

21. How would you approach a data science problem with limited data or resources?

To address a data science problem with limited data or resources, you can use transfer learning, data augmentation, leverage pre-trained models, and focus on feature engineering. Additionally, applying simpler models and rigorous cross-validation can help optimize performance.

22. What is Data Analytics?

Data Analytics involves the process of examining raw data to draw conclusions and insights, using tools and techniques like statistical analysis, machine learning, and visualization. It helps in making informed decisions by identifying patterns, trends, and relationships within the data.

23. What is the difference between Data Analytics and Business Analytics?

Data Analytics focuses on the technical process of analyzing raw data, while Business Analytics applies this data-driven analysis to improve business decision-making. Business Analytics is more strategic, using data insights to drive business outcomes and performance improvement.

24. What is the difference between Data Analytics, Big Data, and Analytics?

– Data Analytics refers to analyzing data to extract useful insights.

– Big Data refers to large, complex datasets that require advanced tools for processing.

– Analytics is a broad term that refers to applying various methods to analyze data for decision-making, which can encompass both Data Analytics and Big Data Analytics.

25. What is the difference between Data Mining and Data Analytics?

Data Mining focuses on discovering hidden patterns and relationships in large datasets using algorithms. Data Analytics is broader and involves analyzing, interpreting, and visualizing data to extract actionable insights, often using the results from data mining.

26. What is the difference between Data Warehousing and Data Analytics?

Data Warehousing is about storing and managing large amounts of structured data in a centralized system. Data Analytics involves analyzing this data to uncover patterns, trends, and insights to support decision-making.

27. What is the difference between Data Science and Data Analytics?

Data Science is an interdisciplinary field that encompasses data analytics, machine learning, and other techniques to extract insights from data. Data Analytics is a subset of Data Science that focuses specifically on analyzing data to inform decisions.

28. What is the difference between Data Analytics and Data Analysis?

Data Analytics refers to the entire process of cleaning, transforming, and analyzing data using advanced techniques like statistical modeling and machine learning. Data Analysis is more about interpreting data and drawing conclusions, often a component within Data Analytics.

29. What is the difference between Data Analytics and Data Visualization?

Data Analytics involves analyzing data to find insights, whereas Data Visualization is the process of presenting these insights visually using charts, graphs, or dashboards to make them easier to understand.

30. What is the difference between Data Analytics and Big Data Analytics?

Data Analytics refers to the analysis of any dataset, regardless of size, to extract insights. Big Data Analytics specifically focuses on analyzing massive datasets that require specialized tools and techniques for handling and processing due to their complexity.

31. What is the difference between Data Analytics and Data Engineering?

Data Engineering involves designing, building, and maintaining data infrastructures, such as databases and pipelines. Data Analytics, in contrast, focuses on analyzing the data processed by these systems to extract insights for decision-making.

32. What is the difference between Data Analytics and Data Reporting?

Data Analytics is the process of exploring and analyzing data to extract insights, while Data Reporting is the presentation of these insights, typically in the form of reports or dashboards, often summarizing key findings for stakeholders.

Statistics in Data Science Interview Questions

33. Explain the Central Limit Theorem and its implications for data analysis?

The Central Limit Theorem states that as the sample size increases, the sampling distribution of the sample mean approaches a normal distribution, regardless of the population’s original distribution. This allows analysts to make inferences about the population even if the population distribution is unknown, using normal distribution-based techniques.

34. Describe the difference between hypothesis testing and statistical significance?

Hypothesis testing is the process of making decisions about a population based on sample data, often determining whether there’s enough evidence to support a claim. Statistical significance indicates whether the results of a hypothesis test are likely due to a real effect rather than random chance, typically assessed using p-values.

35. Explain the concept of Type I and Type II errors in hypothesis testing?

Type I Error (False Positive): Rejecting a true null hypothesis (concluding there is an effect when there isn’t one).

Type II Error (False Negative):Failing to reject a false null hypothesis (concluding there is no effect when there actually is one).

36. Discuss the importance of data visualization in exploratory data analysis (EDA)?

Data visualization is crucial in EDA because it helps in identifying patterns, trends, and anomalies that might not be easily detectable through numerical data alone. Visual tools like histograms, scatter plots, and heatmaps make it easier to understand the structure of the data and inform further analysis.

37. Explain the concept of bias and its potential impact on statistical analysis?

Bias refers to systematic error that causes the results to deviate from the true value. It can skew findings and lead to incorrect conclusions. In data analysis, mitigating bias is essential to ensure the accuracy and validity of the results.

38. Describe the difference between parametric and non-parametric statistical -tests?

–Parametric Tests: Assume the data follows a specific distribution (e.g., normal distribution). These tests tend to be more powerful but require strict assumptions.

– Non-parametric Tests: Do not assume any specific data distribution, making them more flexible but generally less powerful than parametric tests.

39. Explain the concept of confidence intervals and their interpretation?

A confidence interval provides a range of values within which the true population parameter is likely to fall. For example, a 95% confidence interval means that 95 out of 100 such intervals drawn from the population would contain the true parameter value, giving an idea of the estimate’s precision.

40. Discuss the importance of variable selection in regression analysis?

Selecting relevant variables in regression analysis is essential for improving model performance. Including too many irrelevant variables can lead to overfitting, while excluding important ones can result in underfitting. Proper selection techniques help improve model accuracy and interpretability.

41. Explain the concept of collinearity and its potential problems in regression analysis.?

Collinearity occurs when two or more independent variables are highly correlated, making it difficult to determine their individual effects on the dependent variable. This can inflate the standard errors of the coefficients, leading to unreliable estimates. Detecting and addressing collinearity (e.g., using VIF) is important for model reliability.

42. Describe the importance of model validation in data science projects?

Model validation ensures that a model generalizes well to unseen data and avoids overfitting. Techniques like cross-validation and splitting data into training and test sets help assess model performance and reliability, ensuring that the model performs well in real-world applications.

Machine Learning in Data Science Interview Questions for Freshers

43. Explain the bias-variance tradeoff and its implications for model selection?

The bias-variance tradeoff refers to the balance between the error due to bias (underfitting) and variance (overfitting). Models with high bias are too simple and miss patterns, while those with high variance are too complex and fit the noise in the training data. Finding the right balance is key to creating models that generalize well.

44. Discuss the challenges and potential solutions for handling imbalanced datasets?

Imbalanced datasets occur when one class is underrepresented. This can cause models to be biased towards the majority class. Solutions include oversampling the minority class, undersampling the majority class, using different evaluation metrics (e.g., precision, recall), or employing algorithms designed for imbalanced data (e.g., SMOTE or balanced random forests).

45. How would you approach a data science problem with limited data or resources?

When dealing with limited data, techniques like data augmentation, transfer learning, or using pre-trained models can help. Focusing on feature engineering and using simpler models can also improve performance. Cross-validation ensures that the model is robust, even with small datasets.

46. Compare and contrast supervised and unsupervised learning, providing real-world examples?

– Supervised Learning: Involves labeled data, where the model learns from input-output pairs (e.g., email spam classification).

– Unsupervised Learning: Involves unlabeled data, where the model identifies patterns or structures (e.g., customer segmentation using clustering).

47. Explain the concept of dimensionality reduction and its applications in data analysis. ?

Dimensionality reduction is the process of reducing the number of input features while retaining as much information as possible. Techniques like Principal Component Analysis (PCA) are used to remove redundant features, reducing computation and improving model performance. It’s often applied in high-dimensional datasets, like image processing or text analysis.

48. Discuss the ethical considerations involved in using machine learning models?

Ethical considerations include data privacy, algorithmic fairness, and bias in machine learning models. Ensuring transparency in decision-making, mitigating biases in training data, and respecting user privacy are crucial to creating responsible AI systems. Ethical AI development also involves considering the social impacts of automated decisions.

49. Describe the concept of overfitting and underfitting in machine learning models?

– Overfitting: When a model performs well on training data but poorly on unseen data because it has learned the noise and specific details of the training set.

– Underfitting: When a model is too simple to capture the underlying patterns in the data, leading to poor performance both on training and unseen data. Balancing model complexity is key to avoiding both.

53. What are the most common data science interview questions for freshers?

As a fresher, expect questions on key concepts, such as:

– Types of machine learning algorithms: Supervised, unsupervised, and reinforcement learning.

– Model evaluation metrics:Accuracy, precision, recall, F1-score.

– Overfitting and underfitting: How to balance model complexity and accuracy.

– Real-world applications: Examples like recommender systems, image recognition, and fraud detection.

54. How can I prepare for technical coding questions in a data science interview?

Practice coding in languages like Python or R. Focus on tasks like data manipulation, analysis, and visualization. Use platforms like HackerRank or LeetCode to solve problems. Writing efficient code and clearly explaining your process are essential.

55. What are some commonly used statistical tests in data science interviews?

Be prepared to answer questions on:

– Central Limit Theorem: Sampling distributions and inferences.

– Type I and Type II errors: The trade-offs between false positives and false negatives.

– T-tests, ANOVA, and Chi-square tests: Used to analyze differences and relationships between variables.

56. How can I showcase my domain knowledge in a data science interview?

Research the company’s industry and be ready to discuss specific challenges they face. Relate your data science skills to solving those challenges, and mention any relevant projects or work experience.

57. What are some red flags to watch out for in a data science interview?

Be cautious if the company seems overly focused on buzzwords or specific tools instead of problem-solving skills and core concepts. The best interviews emphasize critical thinking and an understanding of fundamental data science practices.

58. How can I stay calm and confident during a data science interview?

Prepare thoroughly, practice common questions, and focus on your strengths. Mock interviews can help, as well as visualization exercises. Keep in mind that enthusiasm and communication are as important as technical expertise.

59. Should I get a data science certification before interviews?

Certifications can demonstrate commitment, but real-world experience, projects, and problem-solving skills are more valuable to employers. Focus on building a portfolio of relevant work to showcase your capabilities.

60. What salary range can I expect as a data science fresher?

Salary varies based on factors like location, experience, and specific skills. Research market rates for your area and level of experience. During negotiations, emphasize your potential and future value to the company.

61. What’s the most important tip for acing a data science interview?

Be passionate, curious, and eager to learn. Highlight your understanding of core concepts, your problem-solving abilities, and your enthusiasm for contributing to the field. A genuine interest in data science will make you stand out.

62. What is feature engineering, and why is it important in machine learning?

Feature engineering involves creating new input variables (features) based on existing data to improve model performance. It is crucial because the quality of features often has a greater impact on the model’s accuracy than the model choice itself.

63. Explain the concept of cross-validation and its importance in model evaluation?

Cross-validation is a technique used to assess a model’s performance by dividing the data into training and testing subsets multiple times. This ensures that the model generalizes well to unseen data and avoids overfitting.

64. What is a confusion matrix, and how do you interpret it?

A confusion matrix is a table that shows the performance of a classification model by displaying the number of true positives, true negatives, false positives, and false negatives. It helps in calculating metrics like accuracy, precision, recall, and F1-score.

65. What is overfitting, and how can you prevent it in machine learning models?

Overfitting occurs when a model performs well on training data but poorly on new, unseen data. Techniques to prevent overfitting include using simpler models, applying regularization (L1/L2), and employing cross-validation or dropout in neural networks.

66. What is regularization, and how does it help in machine learning models?

Regularization is a technique used to penalize model complexity by adding a term to the loss function, discouraging overfitting. L1 regularization (Lasso) adds absolute value penalties, while L2 regularization (Ridge) adds squared penalties to model coefficients.

67. Explain the difference between bagging and boosting in ensemble learning?

Bagging (Bootstrap Aggregating) involves training multiple models independently and averaging their predictions to reduce variance. Boosting trains models sequentially, with each new model focusing on correcting the errors of the previous ones, thus improving accuracy.

68. What are decision trees, and what are their advantages and disadvantages?

Decision trees are a type of algorithm that splits data into subsets based on feature values, creating a tree-like structure for decision-making. Advantages include interpretability and flexibility, while disadvantages include overfitting and instability with small data changes.

69. Explain the concept of gradient boosting and its advantages?

Gradient boosting is an ensemble technique where new models are trained to correct the errors of previous models, optimizing the overall prediction. It improves accuracy and reduces bias, making it effective for many machine learning tasks, though it can be prone to overfitting if not tuned properly.

70. What are random forests, and how do they differ from decision trees?

Random forests are an ensemble of decision trees where each tree is trained on a random subset of data and features. Unlike decision trees, which can overfit, random forests reduce variance and improve generalization by averaging multiple trees’ predictions.

71. What is a kernel in SVM, and why is it used?

A kernel in Support Vector Machines (SVM) is a mathematical function that transforms the input data into a higher-dimensional space, allowing SVMs to separate data that is not linearly separable in the original space. Common kernels include linear, polynomial, and radial basis function (RBF).

72. What is a hyperparameter in machine learning, and how do you tune them?

Hyperparameters are settings that control the learning process of a machine learning model (e.g., learning rate, number of trees). Tuning hyperparameters involves finding the best values, often done through methods like grid search, random search, or automated tools like Bayesian optimization.

73. Explain the concept of dimensionality reduction and its significance in data science?

Dimensionality reduction is the process of reducing the number of input variables in a dataset while preserving important information. This helps in simplifying models, speeding up computations, and avoiding overfitting. Techniques include Principal Component Analysis (PCA) and t-SNE.

74. What is A/B testing, and how is it used in data science?

A/B testing is an experiment where two versions (A and B) are compared to measure the effect of a change. It is widely used in marketing and product development to test user responses to different variations of a feature or interface, helping to make data-driven decisions.

75. What are outliers, and how do you handle them in a dataset?

Outliers are data points that deviate significantly from other observations in a dataset. They can be handled by removing them, transforming the data, or using robust algorithms that are less sensitive to outliers, depending on whether the outliers represent noise or valuable information.

76. Explain the concept of the bias-variance tradeoff?

The bias-variance tradeoff is a fundamental concept in machine learning where increasing model complexity reduces bias but increases variance, and simplifying the model reduces variance but increases bias. The goal is to find the right balance to achieve optimal model performance.

77. What is the difference between a parametric and a non-parametric model?

Parametric models assume a specific form for the data distribution (e.g., linear regression), while non-parametric models make no assumptions about the data distribution (e.g., k-nearest neighbors). Parametric models are simpler but may be less flexible, whereas non-parametric models are more adaptable but may require more data.

78. What is transfer learning, and how is it used in deep learning?

Transfer learning is a technique where a pre-trained model is fine-tuned on a new, related task. It is commonly used in deep learning to leverage models like convolutional neural networks (CNNs) trained on large datasets (e.g., ImageNet) and adapt them to specific problems with less data.

79. What is a neural network, and how does it work?

A neural network is a series of layers of interconnected nodes (neurons) that mimic the human brain’s structure. It works by processing inputs through layers of weighted connections and activation functions, learning patterns through backpropagation, and adjusting weights to minimize error.

80. What are convolutional neural networks (CNNs), and what are they used for?

CNNs are a class of deep neural networks designed for processing structured grid data like images. They use convolutional layers to automatically extract features from the input, making them highly effective for tasks like image classification, object detection, and medical image analysis.

81. Explain recurrent neural networks (RNNs) and their applications?

RNNs are a type of neural network designed for sequential data. Unlike traditional networks, they have connections that loop back, allowing them to retain information from previous steps. They are commonly used in natural language processing (NLP), time series prediction, and speech recognition.

82. What is reinforcement learning, and how is it different from supervised learning?

Reinforcement learning is a type of machine learning where an agent learns by interacting with an environment and receiving feedback in the form of rewards or penalties. Unlike supervised learning, which requires labeled data, reinforcement learning focuses on learning optimal actions to maximize cumulative rewards.

83. What is backpropagation in neural networks?

Backpropagation is the algorithm used to train neural networks by adjusting the weights of connections. It calculates the gradient of the loss function with respect to each weight and uses these gradients to update the weights through gradient descent, minimizing the error of the network.

84. Explain the concept of word embeddings in NLP?

Word embeddings are vector representations of words where words with similar meanings have similar vectors. Techniques like Word2Vec or GloVe are used to create embeddings, capturing semantic relationships between words, making them useful for tasks like text classification, machine translation, and sentiment analysis.

85. What is the difference between batch gradient descent and stochastic gradient descent (SGD)?

Batch gradient descent updates the model’s parameters using the entire dataset, which can be slow for large datasets. Stochastic gradient descent updates parameters using one or a few samples at a time, which is faster but noisier, making it suitable for large-scale data.

86. What is a recommendation system, and how does it work?

A recommendation system predicts a user’s preferences and suggests items (e.g., movies, products) based on past behavior. It uses techniques like collaborative filtering (finding similar users or items) and content-based filtering (matching user profiles to item attributes).

87. What is unsupervised learning, and how does it differ from supervised learning?

Unsupervised learning involves training models on unlabeled data, aiming to find patterns or structures in the data (e.g., clustering). In contrast, supervised learning requires labeled data to train a model to make predictions based on known inputs and outputs.

88. What is the purpose of a loss function in machine learning?

A loss function measures the difference between the model’s predicted values and the actual values. It guides the model during training by providing feedback on how well it is performing, allowing the optimization algorithm (e.g., gradient descent) to adjust parameters to minimize this error.

89. Explain the concept of hyperparameter tuning and why it is important?

Hyperparameter tuning involves finding the best set of hyperparameters (e.g., learning rate, number of trees) for a model. Proper tuning can significantly improve a model’s performance by balancing complexity and accuracy. It is typically done using techniques like grid search, random search, or automated methods.

90. What is an imbalanced dataset, and how do you handle it?

An imbalanced dataset occurs when one class significantly outweighs the others (e.g., fraud detection with few fraud cases). Techniques to handle it include resampling (oversampling the minority class, undersampling the majority class), using different evaluation metrics (precision, recall), or employing algorithms that handle imbalance, like SMOTE.

91. What is the ROC curve and what does it represent?

The ROC (Receiver Operating Characteristic) curve is a graphical plot that shows the performance of a binary classifier as the threshold varies. It plots the true positive rate against the false positive rate, helping to evaluate a model’s ability to distinguish between classes.

92. What is the significance of a p-value in hypothesis testing?

A p-value represents the probability of observing the data, or something more extreme, given that the null hypothesis is true. If the p-value is below a certain threshold (e.g., 0.05), it indicates strong evidence against the null hypothesis, suggesting that the observed effect is statistically significant.

93. Explain the difference between Type I and Type II errors in hypothesis testing?

A Type I error occurs when the null hypothesis is incorrectly rejected (false positive), while a Type II error happens when the null hypothesis is incorrectly accepted (false negative). Reducing one type of error often increases the other, requiring a balance based on the context of the problem.

94. What is time series analysis, and what are its key components?

Time series analysis involves analyzing data points collected or recorded at specific time intervals. Its key components include trend (long-term increase or decrease), seasonality (regular pattern that repeats at specific intervals), and noise (random variation). Time series forecasting is used for predicting future data points.

95. What is the difference between clustering and classification?

Clustering is an unsupervised learning technique used to group similar data points together without predefined labels, while classification is a supervised learning technique where the model assigns predefined labels to data points based on input features. Clustering identifies patterns, and classification categorizes data.

96. What is the curse of dimensionality in machine learning?

The curse of dimensionality refers to the challenges that arise when working with high-dimensional data. As the number of features increases, the volume of the data space grows exponentially, making it harder to analyze and requiring more data to ensure reliable model performance. It can lead to overfitting and poor generalization.

97. What is a silhouette score, and how is it used in clustering?

A silhouette score measures how similar a data point is to its own cluster compared to other clusters. It ranges from -1 to 1, where a high score indicates that the data point is well-clustered and far from neighboring clusters. It’s used to evaluate the quality of clustering models.

98. What is multicollinearity, and why is it a problem in regression models?

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, leading to unreliable coefficient estimates. It can cause issues with model interpretation and lead to overfitting, as the model becomes sensitive to minor changes in the data.

99. What is a decision boundary in classification tasks?

A decision boundary is a surface that separates different classes in a classification task. It represents the threshold where the model decides between different class labels. For linear classifiers, the boundary is a straight line or plane, while for non-linear classifiers, it can take more complex shapes.

100. What is the difference between a generative and discriminative model?

Generative models (e.g., Naive Bayes) model the joint probability of the input data and labels, allowing them to generate new data. Discriminative models (e.g., logistic regression) directly model the decision boundary between classes by learning the conditional probability of the label given the input. Generative models are more flexible but often less accurate than discriminative ones for classification tasks.