No smarts, no bubble, no signals decided by over fitting to a biased engineer preference.
I’m tired of searching for “generic keyword” and getting a page with an extremely low signal to noise ratio written like this:
“Many people search for generic keyword. That is why you can find all about generic keyword here. In fact we specialize in generic keyword and slight alterations of generic keyword.”
It’s like Google stopped caring that people were gaming it.
But I don't get good results for "rug pull".
Now hoping for search engine that favors text-heavy sites and punishes paywalls
It's not always taking me to totally relevant sites but the results contain my favourite type of content.
Full of writing and pure html - usually the hallmark of someone who knows what they are doing, wants to communicate but doesn't want to waste their time.
Coincidentally, the other day I was daydreaming about a search engine that favors sites that are updated less frequently. The thought being, the kinds of labors of love that characterized the 1990s Web that I still sometimes miss are still out there, it's just harder to find them amidst the flood of SEO dreck. So perhaps they could be made discoverable again with the help of a contrarian search engine that specifically looks for the kinds of things that Google and Bing don't like to see.
This was a little different, extremely few results, but a couple of them really made me grin, and all(?) made me curious or raise an eyebrow or reflect on who/what might have been the source of the link, or remember some obscure connection from grad-school. So, if anything a crawled list of results worthy of ponder, thanks for this!
> If you search for "Plato", you might for example end up at the Canterbury Tales. Go looking for the Canterbury Tales, and you may stumble upon Neil Gaiman's blog.
I know it is just a suggestion, but had to try searching both, with no luck in getting the expected unexpected.
btw, interesting how many http (as opposed to https) sites show up...
I love this.
I feel like I'm stuck with Wordpress.com because it brings me some traffic (whereas something hand-rolled on nsfspeech or digital ocean or whatever would literally be off the edge of the web), but the structure of that is so cool.
What would be cool would be for people who host old stuff to "archive" it at some point so it doesn't appear in normal results, only if you tick "include archives".
It’s rudimentary but no IT-related result.
I think some grouping by site (and capping to only the few most relevant links there) would improve the engine.
Wikipedia is perfectly usable without JavaScript and it's one of the nicest sites out there typography-wise, so I'd reconsider this redirection.
and it worked surprisingly well.
Anyone else has good examples?
I never knew it was fear that was preventing me from scrolling
So how on earth do you take an idea like this and scale it for both broad web coverage and high traffic? For that matter, just how much 'useful' text is there on the net?
I’ve often thought that Google could be turned back into a good search engine by simply eliminating the crap and letting the useful sites float to the top of the results.
marginalia.nu seems to like my sites, so it must be good!
Some results are prefixed with ! or an arrow dingbat. What does that mean?
You're doing God's work here. Thanks and good luck.
For a simple test, I searched "fall of the roman empire". In your search engine, I got wikipedia, followed by academic talks, chapters of books, and long-form blogs. All extremely useful resources.
When I search on google, I get wikipedia, followed by a listicle "8 Reasons Why Rome Fell", then the imdb page for a movie by the same name, and then two Amazon book links, which are totally useless.
I really think I want an internet console, not an animated magazine.
Although a search for "Fractional Reserve Banking" shows that some further ranking improvements can be made to exclude unrelated results, and potentially penalize old conspiracy sites.
https://search.marginalia.nu/search?query=fractional+reserve...
Cool concept though!
Oh man, I love subtle jabs and tongue in cheek writing like this. Very Robin Williams-esque.
Perhaps more importantly, it delivers the most correct result when searching for my username: the first result is not any of my social media accounts, or even my own blog, but the text of the obscure science fiction story that I took my username from! Well done.
I've immediately added this as a search keyword in Firefox, and I'll be using it more in the future.
Could meta search engines like DuckDuckGo include this as a source? Should they?
Please do announce it here if you ever decide to solicit help or contributors. My stab at this problem was to have a search index of only ad-free pages, on the hypothesis it would turn up self-hosted blogs, university personal pages, that sort of thing. But the results were too thin, your approach is much better.
I searched python make a bar chart and it returned a live coding video with an AI generated text transcript and two articles which mentioned a different kind of bar.
I then narrowed it down to just python bar chart, and got a blog post about scripting with a bar chart in it, this http://www.nitcentral.com/voyager4/hellyear.htm with monty python, bars, and charts from 1996 and among some other things I found this https://python-course.eu/naive_bayes_classifier_introduction..., which had an example of a python bar chart even though the title of the page made me think it wasn't what I wanted.
So for what I imagine to be a difficult search because of all the different meanings of the words, I found my result on the second query pretty quickly, and found some cool unrelated stuff too.
I like mostly that I get what I type in, and not exactly what I want, but what I want is there too.
The third search result for "dog" is this page on how to remove AOL Instant Messenger, published in 2002.
https://sillydog.org/netscape/kb/removeaim.html
No one wants to see newsletter signup popovers, but "modern web design" includes good performance and relevant content. (The search engine itself takes about 2 seconds to first contentful paint, not great.)
There do seem to be some text encoding issues though. For example: https://search.marginalia.nu/search?query=tim+visee
I hate pinterest with a passion. I may need to get a "his" laptop separate from my wife, since she needs that darn pinterest extension for pinning photos.
Wonder how 'text-only first' prioritization is being implemented, algorithmically speaking?
I'm sure I'm not the only one, either. Content-rich sites need more love.
The first results on Google are Australian site for buying a LC70 (news-flash, I can't buy one in the USA). There is also a MotorTrend article about the LC70...also irrelevant since it's only sold in Australia.
No comment.
NFL one had 'some' decently related results, but the websites were all strangely disreputable.
It's... no... it can't be... a search engine that finds actual information instead of 5 megabyte blobs of tracking code and SEO crap!
- Search results are not overrun with commercial, SEO stuffing, "content" farms.
I don't know what to say. This is such a refreshing sight. Well done.
Ivermectin (Google): https://www.google.com/search?q=ivermectin
The difference in the overall _thrust_ of the results is remarkable.
Very interesting! Thanks for building it.
This is a very polite way of saying "this engine isn't very good"
Overall impressed with the project but I thought the word play there was funny
- Search results often very old, from the early 2000s (I guess because back then more websites were text oriented). Are you taking into account the age of the page when showing results? It would be great to see more up-to-date results at the top
- I noticed a few results which directed me to websites with security risks, Firefox didn't even let me open them. Is it possible to filter these out from the results?
It does find a handful of references to me from over twenty years ago, though, which I thought was fascinating.
I miss finding blog posts and scholarly articles in long form. I hate the SEO sites with unreadable UI because the information in them is often a lot lower quality as well.
However, it seems that it currently does not support non-Latin alphabets. Which I understand in an early version. Still, it's handling of such "exception cases" could be improved:
when I search for a Russian word, say "Аквариум", I get <<Search "Аквариум" needs to be a word>>, which is rather rude...
The Google result are all blogspam or sales pages for cheap shipped saunas. Lots of "IR" results. Phony health benefit pages. Stock photos solely of beautiful new hotel gyms.
I've noticed this problem with Google results for quite some time. Sadly, the new content being created of the top variety is mostly being done within private Facebook groups that can't be easily searched, linked, or archived.
<<javascript pipe syntax>>: none of the search results appeared to have anything to do with Javascript pipe syntax. (Which doesn't exist yet, but it's under discussion.) Google gives a bunch of highly-relevant results.
<<hans reichenbach relativity>>: first result is a list of books about relativity, one of which is Reichenbach's "Philosophy of space and time"; good, but there's no real information there. Second is about Reichenbach but nothing to do with relativity or even, really, philosophy of science. Third is about philosophy of science and mentions some of Reichenbach's work but not related to relativity. Fourth mentions Reichenbach's "Philosophy of space and time" as part of a list of books relevant to a seminar on "time and eternity". None of this is bad, but it's not great either. Google gives a couple of online philosophy encyclopaedia entries, then a journal article on "Hans Reichenbach's relativity of geometry", then the Wikipedia article on Reichenbach ... much more informative.
<<luna lovegood actress>>: I thought this would be an easy one. It was easy for Google, which gave me her name in large friendly letters at the top, then her IMDB entry, and a bunch of other relevant things. Literally nothing in the Marginalia results was relevant to the query.
I guess maybe popular culture is just too monetizable, so no one is going to write about it on the sites that Marginalia crawls? Let's try some slightly less popular culture.
<<wilde "a handbag">>: First result is kinda-relevant but weird: it's about a musical adaptation of The Importance of Being Earnest. It doesn't mention that famous line from the play, but one of the numbers in the musical has the words "a handbag" in the title. Second result is a review of a CD of musicals, including the same work. Third is a bunch of short reviews of theatrical items from the Buxton Festival Fringe, one of which is a three-man adaptation of TIOBE. Next four are 100% irrelevant. Next is a list of names of plays. Last one is actually relevant; it's an article about "Lady Bracknell through the decades". Google puts that one first (after, sigh, a bunch of YouTube videos which look as if they might actually be relevant).
I really like the idea of this, and many of the things it turns up look like they might be interesting, but it isn't doing very well at producing results that are actually relevant to the thing being searched for.
I did search for "Daria Bilodid" and the results were a bit troublesome. First the Wikipedia result did not work: https://en.wikipedia.org/wiki/Daria_Bilodid vs https://encyclopedia.marginalia.nu/wiki/Daria_Bilodid
Secondly the results matched a few judoinside.com results which is ok, including sites to her competitors, but seemed to miss the judoinside website for her: https://www.judoinside.com/judoka/92660/Daria_Bilodid.
The design is hard on my eyes, I have a average size screen and its using less than half of the width. The line-height is enormous and seems to breakup flow making it uncomfortable for me to read. The spacing around each result is the same as between titles and paragraph items, which again was unpleasant to read.
If so, would you ever tweak the parameters to surface sites that that aren't served with "HTTPS"?
It also does not work for subjects. For example, if you search "discrete math" it links to academic webpages, but most of them do not have any notes posted. It is just a plain text website with the syllabus of a class.
Is that accidental or is this website promoted because it's text heavy and will surface for any search without many results?
millionshort.com
I'm looking at "punishes modern web design"... This thing IS modern web design. I think it's called "marginalia" in reference to the huge margins they chose!
I'm using a browser on a linux desktop and side-by-side, HN's page design is old-fashioned tasteful making pretty good use of space, and maginalia has a font that's more than twice the 2D pointsize and is so spread out with whitespace that the "Tips" on the home page are off the bottom of my window.
Suggestion - Use system fonts (the site downloads almost 300k of fonts)
For example, a search term for 'Scotichronicon' returns some fascinating results, but the search term itself doesn't appear in the title or excerpts of most of the results.
This makes it harder to judge how relevant they are.
I thought that we've reached the time to embrace all cultures in the world, but this retrogressive engine proves that most modern tech designers are myopic about other civilizations in the globe.
I currently use google as it's set as the default search when I type in the address bar but would love to switch and move google/ddg to a added character like "<search terms> @g"
https://search.marginalia.nu/search?query=rxjava+2+api+docs
https://www.google.com/search?q=rxjava+2+api+docs&oq=rxjava+...
Some reactions:
- The font is really big and the columns really narrow, so I get 3 - 4 entries per page, something like 8 words per line, and huge spacings between lines, which makes it a frustrating experience. I've been using the recommendations in https://practicaltypography.com/, which recommends 60 - 90 characters in a line I think, and line spacing of 120% - 140% (I like 125%). The line lengths here might technically fall within the lower bound, but it's really short, and for search results I'm going to try scanning the text to see if there's something relevant, so I think going on the long side is better here. At least make the width somewhat variable so that I can shrink the rather large font and fit more on the line.
- The results are eclectic, but I'm not sure it's usable at the moment. "scala append list" did not get me much that's helpful, while Google will usually at least put up some click-farming tutorial that although minimal effort does tend to answer the question. Both "mapo doufu recipe" and "ma po do fu recipe" had very few recipes, although the latter did have one. Unfortunately, recipe websites are some of the worst, with about 10 pages of description, ads, pictures, what-have-you until the recipe at the very bottom. "collection unmitigated pedantry" did return the acoup.blog entry at the top, though.
Good luck on the project!
-
> Convenience functions have been added, and the search engine can now perform simple calculations and unit conversions. Try 1 pint in cubic centimeters, or 50+sqrt(pi). This functionality is still under development, be patient if it doesn't work.
Why would you make any ever so small effort to implement calculations? I don't get it.
If your search engine enabled me to find more useful search results to my queries than google or yacy or whatever, I wouldn't care one tiny bit about being able to do calculations with it.
Why not focus on the search functionality?
http://demont.myds.me/leerecipes/mainmeals/mainmeals1/chicke...
compared to Google's endless drivel of "This chicken stir fry recipe will become a staple in your home. It’s so quick to make and you can use whatever vegetables you have on hand. It tastes wonderful regardless of how you alter the ingredients. ... " JUST SHUT UP AND GIVE ME THE RECIPE!
I always search myself on new search engines to compare the results. Most engines return my personal blog/website, books/stories I've written, news stories, my github projects/contributions, social links, etc.
This search engine surfaces just three obscure IRC logs that contain my nick in join/part messages (nothing said from me!) from 2009. And nothing else.
There's probably some things this approach is really good at but I'm not sure what they'd be for me off hand. Always cool to see new approaches to search, though.
If you figure out some sort of funding model (maybe even just Patreon) I could totally see this as a viable side project
Already discovered this recipe site: https://based.cooking/
I love how adding recipes is through pull requests: https://github.com/LukeSmithxyz/based.cooking/pulls
Granted, google's web results above are perhaps what people are looking for 75% of the time, but how limiting and boring.
I'm also a sucker for the simplistic text-centric, information-laden pages from the pre-facebook era.
For 'global warming', however - since Marginalia excludes modern web-design pages - the results are of dubious relevance and interest, since they are, well, 'old'.
I see myself using this engine a lot.
Such a relief to not wade through oceans of worthless crap any more.
I didn't realize how much I missed this stuff.
The popular web has become so bad nowadays.
For example, if you search "jehovahs witnesses", all pages from jw.org are missing.
Exactly the same thing happened when I searched "mormons" - the official website is missing and it only brings up sects/hate/conspiracies against mormons.
I searched for high pressure air (HPA) regulator trying to find a description of how one works. I didn't find that, but did find some interesting links on how they're used in scuba, and one guy's homemade gas laser.
The results were all either barely relevant, outdated (sites that covered the game back in the 90s/2000s before it was re-released), at best tangentially relevant or complete garbage noise. Some of the most highly relevant pages (such as the Steam store listing, the fandom wiki, the publisher/developer's forums for the re-releases, the Baldur's Gate 3 website and the subreddit) were not included at all. Those are all fairly text heavy by any reasonable standard, so I assume they were "punished" because they use JS? Would make sense that nearly all of them are way out of date.
Then I searched specifically for "Baldur's Gate Wiki" but still out of luck - some results, but nothing vaguely Wiki-like.
Finally I searched for "Baldur's Gate Fandom Wiki". This is basically "search engine easy mode", by giving essentially the name of of the site I am looking for. I got ZERO results. At this point I gave up and decided that this thing is useless.
Look, I'm all for unearthing good long-form content (in fact I would say that much of the content around this specific game would qualify), and I do get as annoyed at modern SPAs as the next grumpy neckbeard.
I think considering both of those in a search engine is not a bad idea in and of itself. But I have to wonder what's the point of a search engine that weights some arbitrary aspect of web design higher than the relevancy of the subject matter (to the point of not returning any results at all)? In fact, considering that generally speaking more recent websites tend to include more scripting, you are intentionally skewing the results towards (very) old content, which is probably doing the user a disservice.
Here is the second result when you search for “cat food”. It takes you to some old dudes entire family tree with full history and biographies… it even uses sub domains and everything! Crazy!
I love the idea definitely, and I've long toyed around with building a similar thing that starts crawling off my own bookmarks (a personal small-deep-web if you wish).
I also love the "Small Web" name: this is the first I hear of it, and it's what I've long complained about — the web today hides all of the cool gems search engines of old would have given you!
I am also a bit split on the "www" prefix restriction (iiuc, domains which do not have "www" subdomain too are dropped from the index because many of them are spammy): it might for sure be a useful heuristic, but I've advocated for dropping "www" back in late 90s and early 2000s already (one reason being that for eg. Serbian, "w" is not in the alphabet, so you can't reasonably quote it as Serbian is otherwise a phonetic-language).
Searched for “Ramanujan”, one of my heros.
Found this gem- https://math.ucr.edu/home/baez/ramanujan/
Ramanujan’s “easiest” formula.
Awesome!!
The second query was a nation+city(in the nation). Got a lot of result that were in no way related to either.
It seems to be biased towards IT topics (based uniquely on the two queries).
https://search.marginalia.nu/search?query=gan+charger
aside from nexperia none of this looks even remotely relevant.
As expected I am getting a lot of early 2000s sites which is something that I miss on regular Google.
Hilariously searching for "array data structure" got me one of the top results this little tiny page: http://infolab.stanford.edu/~backrub/google.html
I wonder if some browser engineers are trying to have some ideas on how to find a solution on this. Personally, I would just make a browser that breaks backward compatibility, remove old features, etc. I guess browsers would be much lighter, fast and simple if some hard choices were made.
Mozilla already decided to break some websites with the strict cookie policy. I wish they would do the same for everything else that sucks on the modern web.
I honestly don't think I have much respect for "web developers". In a way I want mobile apps to kill the modern web, just to prove a point.
That way I would not have to remember or bookmark just use my search bar as normal and choose which engine for this query or set it as default.
I think it would benefit from using a responsive layout, allow the text expand to a wide 1000+ px, make the font smaller, so the excerpt can fit one or two lines below the links.
Google has problems but their search results layout is easy to scan.
Otherwise I genuinely wish I would use it, because the Google search's "self referential reality bubble" is really annoying.
Disclaimer: I am not a native english speaker. English is my second language.
I've not only bookmarked it but also I've an icon linked to it on the taskbar. Will watch its progress with interest.
However, I searched "infiniband" and the results are far away from what I would expect or like to see. Most of the results that appear first are completely unrelated to the topic.
Maybe something like:
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"
xmlns:moz="http://www.mozilla.org/2006/browser/search/">
<ShortName>Marginalia</ShortName>
<Description>Marginalia Search - a search engine that favors text-heavy sites and punishes modern web design</Description>
<InputEncoding>UTF-8</InputEncoding>
<Image width="16" height="16" type="image/x-icon">https://search.marginalia.nu/favicon.ico</Image>
<Url type="text/html" template="https://search.marginalia.nu/search?query={searchTerms}" />
<moz:SearchForm>https://search.marginalia.nu/</moz:SearchForm>
</OpenSearchDescription>
and then add: <link rel="search"
type="application/opensearchdescription+xml"
title="Marginalia"
href="/opensearch.xml" />
to your page's <head>.
- Evolutionist scientists say the theory is unscientific and worthless
- Seven Mysteries of Evolution
- OTHER EVIDENCE AGAINST EVOLUTION
- Evolution Falsified
Not a single result about the evolution of giraffes...
Author said they threw it together on consumer hardware. How big is the index? (TB used or entries) how is it realised?
I'm pretty much interested in this since I myself am crawling some pages for my own "search index".
Oh and thx for making and posting. Added it as a keyword to firefox
Edit: Just realized that my question is a bit shallow. What I'm particular interested in is the storage before the indexing. I'm trying to store the raw html so that I can reindex everything with better algorithms, but I'm hitting many limits. It takes a few minutes getting the size of a site-directory (every site has it's own dir) and I'm at a point where I can't reasonably manage the scrape-versioning over git and I cycled through a few filesystems only to find that the metadata management kind of sucks for most of them. It's rather interesting how we store such files and I'm thinking about storing a few sites in a simple sqlite format for easy access and search. I'm thinking about a a few low overhead solutions like facebooks project haystack (implemented open source in seaweedfs) or something similar... Hopefully this gives some context to the question of storage and sites that are indexed
I'm experimenting with using it as my default ...
First result had a recipe I could see both recipe and directions in a single page, no ads, no scrolling, no fake seo anecdotes about kids and grandmas.
(Pls make the search query box fit small mobile devices)
Great project idea!
(It seems to me that if you want to build a good search engine, this is the question you need to address first.)
This works https://search.marginalia.nu/search?query=define%3Ahallucina... This doesn't, or at least it's empty https://search.marginalia.nu/search?query=define%3Ahallucina...
What set aside Google from its competitors was its use of eigenvalue weights. I don't sense a robust weighting system in use here.
http://www.aliens-everything-you-want-to-know.com/
This is the Information Superhighway in all her glory.
Heh, I guess I'm getting old but I remember when this was the only way to search the web
For example, I'm a huge fan of obscure Brian Eno stories and interviews and articles.
Using the default blend for "Brian Eno", I found http://www.moredarkthanshark.org/eno_interviews.html which is truly a labor of love and the most comprehensive list of articles (with links to them, too) I've ever seen.
Not in a hundred years would I have found this using Google.
Thank you for building this!
Comments: