Though there are lots of challenges out there that can be solved using data science, I come up with a very basic problem. Why I am choosing this -- simply because I do not want to waste my energy just by thinking about big problems and also not to waste your energy to read it and forget in couple of hours :).
PROBLEM: We will classify the sentiment of news titles (Whether positive or negative or neutral.) Problem Analysis:
For this part, we will answer following questions:
- What will be the input for your program?
- What output you are expecting ? Whether it is continous value (something like given by regression model?) or it is a label value (like spam or ham?)
- What sort of problem is this? Whether it is supervised or unsupervised learning? Is it a problem of classification or clustering ?
- What are the features you should take into account?
As we already know, the sentiment of any text can be classified into three major categories (positive, negative, and neutral), the problem is clearly a type of classification. And, we will be using labelled data (supervised learning).
Your ML model will not be able to predict the correctly if you don't have enough training data.
Suppose, you feed following two texts into your algorithm,
'New government rule aganist people will' -->Negative 'Government offers funding opportunities for the startups' -->Positive
What would you expect if the new input text is
'Tech Companies in Nepal are having problem to raise funds. Nevertheless, they are doing great with customer acquisition'
Your model is not being taught with this kind of complex structure.