Statistics and probability

3 min readNov 24, 2020

Categorical Data

Categorical data is a collection of information that is divided into groups. I.e, if an organization or agency is trying to get a biodata of its employees, the resulting data is referred to as categorical. This data is called categorical because it may be grouped according to the variables present in the biodata such as sex, state of residence, etc.

Categorical data can take on numerical values (such as 1 indicating Yes and 2 indicating No), but those numbers don’t have mathematical meaning. One can neither add them together nor subtract them from each other.

Two main types of Categorical Data

Nominal Data: Data that is categorical and does not have any order. Eg: Male, Female.
Ordinal Data: Data that is categorical and have an order. Eg: Education of a person.

Qualitativeness

Categorical data is qualitative. That is, it describes an event using a string of words rather than numbers.

Analysis

Categorical data is analyzed using mode and median distributions, where nominal data is analyzed with mode while ordinal data uses both. In some cases, ordinal data may also be analyzed using univariate statistics, bivariate statistics, regression applications, linear trends, and classification methods.

Displaying and comparing quantitative data

Quantitative data is information about quantities; that is, information that can be measured and written down with numbers. Some other aspects to consider about quantitative data:

Focuses on numbers
Can be displayed through graphs, charts, tables, and maps
Data can be displayed over time (such as a line chart)

Quantitative data can be visualized through Bar graphs, histograms, Pie Chart, etc.

Exploring bivariate numerical data

The correlation coefficient ‘r’ measures the direction and strength of a linear relationship. Calculating ‘r’ is pretty complex, so we usually rely on technology for the computations. We focus on understanding what r says about a scatterplot.

It always has a value between −1 and 1 where strong positive linear relationships have values of r closer to 1 and strong negative linear relationships have values of r closer to −1.

Probability Distribution

The proqbability distribution is a function that describes all the possible likelihoods and values that can be taken by a random variable within a given range. For a continuous random variable, the probability distribution is described by the probability density function. And for a discrete random variable, it’s a probability mass function that defines the probability distribution.

Probability distributions are categorized into different classifications like binomial distribution, chi-square distribution, normal distribution, Poisson distribution, etc. Different probability distributions represent different data generation process and cater to different purposes. For instance, the binomial distribution evaluates the probability of a particular event occurring many times over a given number of trials as well as given the probability of the event in each trial. The normal distribution is symmetric about the mean, demonstrating that the data closer to the mean are more recurrent in occurrence compared to the data far from the mean.

Hypothesis testing

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use.

Chi-Square test

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

A chi-square goodness of fit test determines if a sample data matches a population.

A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

Statistics and probability

Written by Siva