What are other metrics that can help validate the F1 score? I'm asking because I believe having a small sample size can skew the number or hide a flaw in the classification algorithm it's scoring.
In other words, what else should I ask to validate that a 70% F1-score is better than a 90% F1-score but on a smaller data set?
That is a good point. It helps to think of Precision and Recall (so by extension F1-Score) from your test data as random variables sampled from a distribution modeling the probability of getting each value in your sample based on a "True" precision/recall value. I won't go too deep into the math, but this was part of the approach in the confidence calculations towards the end of the paper: being able to factor in the uncertainty of your classification metrics to confidence calculations.
To formally answer your question, the main things that matter in determining how stable your F1-Score from your test set is are:
- Size of the test set
- % of test set that has the label (in our case feedback tag)
- the values found for precision and recall
If you have a large dataset and you train your data accuracy till 100%, what about the noisy data? Will this classification system will reduce the noisy values in the system? Machines during training or runtime prediction do have some amount of noises. Can we reduce those through this classification?
That depends on what you mean by noise. In the case of text classification, a lot of the noise in training is disagreement between human labelings. Unfortunately, classification systems will only be as good as the labels they are trained on.
However, if you are referring to noise as in typos and misspellings - then yes, depending on the training and/or preprocessing steps, classification systems could potentially reduce the noise in the input data to still achieve good results.
By noise I mean to say that overfitting your data which means it starts to take your wrong input data and plot the graph involving those input data as well.
Anyway, I got your point. Thankyou
In other words, what else should I ask to validate that a 70% F1-score is better than a 90% F1-score but on a smaller data set?