Graphical : Exploratory Data Analysis (EDA) methods?
First of all, EDA is about exploring the data and understanding whether the data will be good for the experiment and study. Graphs and plots can easily show the data patterns. The raw data can be difficult to understand for patterns and fitness, Graphs can easily show some information about the data.
Graphical Methods can be as follows:
1. Scatter Plots
2. Histograms
3. Box Plots
4. Normal Probability plots
Quantitative Exploratory Data Analysis Techniques:
1. Interval Estimation (Ranges)
2. Hypothesis testing (Null Hypothesis, Alternate Hypothesis)
1. Interval Estimation (Ranges):
Create a range of values within which a variable is likely to fall. Confidence Interval (mean will be here) is an interval estimation.
2. Hypothesis testing:
Test various propositions about the data
Example: Test that the mean age of the Canadian Population is 53.
(The null hypothesis (often denoted H0) is the claim in scientific research that the effect being studied does not exist. Wikipedia)
It’s a multi-step process. Steps can be as follows:
1. Test Null Hypothesis: Assume the Hypothesis is true
2. Alternate Hypothesis: Hypothesis that will be accepted if the null hypothesis is rejected
3. Significance Level: what level of significance the null hypothesis will be conducted (i.e. 95% of the time, the average return of index investing is 6% for 10 10-year period)
4. Test Statistic: Numerical measure showing sample data is consistent with the Null Hypothesis
6. Critical Value: If the test statistic (numerical measure) is more extreme than the critical value, the null hypothesis is rejected
7. Decision: The decision is made by considering the Test Statistic and the Critical value
Some Basic Probability Distributions:
Binomial Distribution: When the variable can have only one of two values
Poisson Distribution: Describe the likelihood of the given number of events occurring during a time interval (customers to your shop in an hour)
Normal Distribution: Symmetrical data. The probability that a variable will have a given distance from the mean on both the lower and the upper/higher side is equal.
t distribution: Similar to Normal Distribution. Extremely large or extremely low values are highly likely. Shows too much variance. Useful when the sample size is small (it is also told when there is not variance, standard deviation)
Chi-Square Test: A Test to see if a population follows a particular distribution, such as a normal distribution.
The F distribution: To test if two datasets are from the same population (by using variances).
Related Concepts:
What is Z-score?
The probability of a particular score occurring in our normal distribution.
Helps to compare two values that are from two different normal distributions
Another definition: it is a measure on how a value is related to the mean.
Chi-Square test for Normal Distribution:
Null Hypothesis: No relation exists between categorical variables. They are independent. If the Hypothesis is true, it is a normal distribution
What is the p-value in Chi-Square test:
P-value is just a significance. Helps to understand the significance of the result. A small p-value means strong evidence against the Null Hypothesis.
Reference: Anderson A., Semmelroth D., Statistics for Big Data