I am a data scientist and an avid reader of books ranging from biographies to statistics. The most important learning from these books is to tell your story using stories and not facts. Preventing the audience from zoning out is essential to help them learn. The series of statistical musings is a place for me to jot down the learnings that I come across and find amusing as a data scientist and I am trying my best to make it accessible to everyone. Because every decision you make requires you to go through a complex optimization problem.
TL;DR: Improbable isn’t…
The ongoing BLM protests have caused a shift in the news coverage from the coronavirus to racism, which made me curious to look into how the headlines have altered from the time before George Floyd’s death and after.
In this article, we will use content from the New York Times collected from their archives through their API and form word clouds based on the frequency of words as well as use topic modeling to recognize and visualize the dramatic change. To accomplish this, we will use a mix of spaCy and Scikit Learn.
“Feature Importance”, “Feature Selection” are different phrases meaning the same thing- finding the features which contribute the most information towards the learning task. In this post, we will learn how to utilize Scikit learn to retrieve the best features.
There are multiple ways of extracting features that create new features in a different vector space altogether such as Principle Component Analysis (PCA), but there are times when we want to know the importance of the existing features before generating any secondary features. …
“Tesla shares tank after Elon Musk tweets the stock price is ‘too high’ ”, was one of a recent headline even after the previous court order requiring him to get a company lawyer’s approval before issuing any written communications regarding Tesla’s finances. In this article we look into scraping Elon Musk’s tweets and Tesla’s stock prices from Yahoo Finance followed by sentiment analysis and analyzing a relationship with the variation in Tesla’s stock price.
All the code and *.csv files can be found in my GitHub repository-https://github.com/ApurvaMisra/tweet_analysis.git and only snippets of the code are provided here.
GetOldTweets3 library was used…
We will start with a brief overview about the idea and then move over to the variety of tests and try to include an example to work with, in Python.
Hypothesis testing is a way to form Statistical Conclusions about the population from data collected from a smaller sample size compared to the population size. Hypothesis is a statement about a parameter that we would want to prove or disprove hence the names:
Null Hypothesis=Ho= Status quo [For example: Treating humans to a particular sunsceen does no change the rate of getting burnt]
Alternate Hypothesis=Ha=Reason why data is being collected[For…
Let’s first start with the use of Statistical tests with the help of an example-
A company called ‘pineapple’ wants to know whether there is an increase in sale if they give out a 10% discount on their app compared to the control group which is not given out any discount in the next three weeks, its necessary to account for a time duration because there might be users who are very active and otherwise.
In the above case, there are two conditions
Our null hypothesis is always the one in which we…
Data Scientist at Pronavigator| Data Science | ML| Statistics