Here’s a quick (and extremely simplified) explanation of PageRank from Google’s site:
PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important.”
Now, you can understand why on the web this makes sense right? Every page should link to another, and so on. The genius of PageRank is that it establishes relevance by determining if good content is linked to by other good content. If it is, then you’ve got a nice PageRank. This is a radical departure from the early days of search where webmasters would cram keywords into the sections of their webpages in order to rank higher. That method tended to work when the earliest engines looked at the frequency in which keywords appeared in a given document. PageRank is brilliant because it tries to establish context.
Now that I’ve bored you to death, I’ll get back to the point: 9 out of 10 of the documents on your network have no links between them.
So what you really need to build a half decent search tool / semantic desktop is something that helps you add metadata to documents - the like the MusicBrainz Tagger and MP3s.
A further thought:
If you can find links in word documents, for instance in homework etc; and you have access to a huge range of information about those links (like, say, google), you should be able to hazily guess that word document X with 3 links to cancer journals is a source of information on cancer.