OK, I got it. Well, each URL is still a (60,000)-dimensional vector, just that 59990 entries happen to be zero and the 10 that are non-zero have certain values which show how many times that tag was chosen. Problem is that not the same ten are nonzero for each URL. Did I get it right?
So what I was suggesting was to perform some off-the-shelf projection method on this data, for example principal components analysis (PCA) which will project your data from 60,000 into say 20-dimensions (as you choose) and then for each URL you will have just 20 numbers which are directly comparable across URLs. You can then more easily do k-nn on this data.
The advantage is that:
1) you reduce noise and the data will be easier to handle
2) the obtained new "components" can be interpreted as more subtle descriptors of the URLs than the original tags.
For example one component could become the "techiness" of the URL, the other the "personalness" the other the "comediness", etc. All this comes out automatically from the data and you can interpret it by looking at what mixture each component will have of the original features (the tags)..
This is just a suggestion, I mean, the algorithm seems to be great as it is, but still, the technique seems a bit crude: if one page is labeled "tech" the other "technology", I think it will miss the connection. However, PCA could find out that these features are related. etc. etc. etc.
> OK, I got it. Well, each URL is still a (60,000)-dimensional vector, just that 59990 entries happen to be zero and the 10 that are non-zero have certain values which show how many times that tag was chosen. Problem is that not the same ten are nonzero for each URL. Did I get it right?
Correct.
Interesting suggestions with the PCA. One of the strengths of having most of the vectors contain 0 values is that it makes computation much, much easier, even if the vector size is so large. I can quite quickly dismiss any vectors that have 0 dimensions in common. If I were to condense the 60,000 dimensions to something like 20, I'd have to do far more calculations to determine the top matches, as it would be tougher to distinguish vectors that are clearly not a match. I'd have to switch algorithms to something more suited for "more data points in less dimensions," which is something I'd love to do but cannot afford to spend time on right now.
If this type of analysis is as flexible as you say it is, then its possible I could get some gains from condensing the dimensions down to something like 1000, or 500, which would take care of superficial similar tags like "tech" and "technology", while still maintaining the diversity that is there. Looking at the most common tags, I can say with pretty good certainty that many of them will not be correlated enough to condense them together without losing some inherent "match" capability -- but then again I know nothing about PCA.
I'm not at all familiar with PCA. I'm going to go research it as soon as I finish writing this and other responses. If you could, though, perhaps you could explain briefly how it works? Your explanations resonate well with me. As I understand it, it's a method for projecting high amounts of dimensions into lower amounts, probably by figuring out correlations between each dimension then "combining" them in some sort of way. Something akin to creating the desired number of projected dimensions such that their orthogonality is as high as possible. Am I on the right track?
> One of the strengths of having most of the vectors contain 0 values is that it makes computation much, much easier, even if the vector size is so large. I can quite quickly dismiss any vectors that have 0 dimensions in common.
I agree, and depending on the data, it might totally be worth sticking with this for simplicity. One has to think if it is good or not to dismiss these pairs which have zero dimensions in common. Say if one url has "people, tech, jokes", the other has "ppl, technology, fun", maybe they should not be missed. On the other hand, maybe because of the tag suggestions, as you said earlier, the tags might be quite homogeneous.
Your intuition about PCA is very correct. The way it works is that it splits your matrix into a product of two matrices. Say you have a large matrix with n rows (urls) and 60000 columns (tags). Let's call this X.
PCA will give you X = A*S (approximately)
Matrix A will have n rows and say 20 columns, matrix S will have 20 rows and 60000 columns (notice that there are much less entries in total, this is a compression of the data)
Matrix A will tell you how the original urls's are represented in the new features, matrix S will tell you how the new features are represented in the old features (a linear mixture of the old features a.k.a the tags). Depending on how you do this factorization into two matrices, you can have different methods but PCA is optimal in certain important ways.
Lot of nice intros if you search for "pca tutorial" and there are ready implementations in most languages. The basic ideas are simple but then the rabbithole goes deeper and deeper - just like this thread, so I'll stop here :)
So what I was suggesting was to perform some off-the-shelf projection method on this data, for example principal components analysis (PCA) which will project your data from 60,000 into say 20-dimensions (as you choose) and then for each URL you will have just 20 numbers which are directly comparable across URLs. You can then more easily do k-nn on this data.
The advantage is that: 1) you reduce noise and the data will be easier to handle 2) the obtained new "components" can be interpreted as more subtle descriptors of the URLs than the original tags. For example one component could become the "techiness" of the URL, the other the "personalness" the other the "comediness", etc. All this comes out automatically from the data and you can interpret it by looking at what mixture each component will have of the original features (the tags)..
This is just a suggestion, I mean, the algorithm seems to be great as it is, but still, the technique seems a bit crude: if one page is labeled "tech" the other "technology", I think it will miss the connection. However, PCA could find out that these features are related. etc. etc. etc.