Hacker News new | past | comments | ask | show | jobs | submit login
YaCy – your own search engine (yacy.net)
271 points by modinfo on Aug 25, 2022 | hide | past | favorite | 93 comments



Related:

YaCy: Decentralized Web Search - https://news.ycombinator.com/item?id=22246732 - Feb 2020 (41 comments)

YaCy: a free distributed search engine - https://news.ycombinator.com/item?id=12433010 - Sept 2016 (24 comments)

YaCy – Peer to Peer Search Engine - https://news.ycombinator.com/item?id=11956268 - June 2016 (3 comments)

YaCy: Decentralized Web Search - https://news.ycombinator.com/item?id=8746883 - Dec 2014 (29 comments)

YaCy takes on Google with open source search engine - https://news.ycombinator.com/item?id=3288586 - Nov 2011 (17 comments)


If you haven't heard of Brave Goggles (https://github.com/brave/goggles-quickstart) I highly recommend checking it out. Just being able to create the search index is a massive task, so being able to apply rules server-side to their "expanded recall set" will give you what most people building search engines want, which is to control the algorithm. We weren't able to do that until now since applying rules client-side doesn't work well on a small search result set.

Related: I created a tool to create Goggles using subreddits as a signal source for domains: https://github.com/forcesunseen/narwhalizer


Seems like you're burying the lead a bit since your "Basic Usage" involves running some Docker instance for some reason and you don't need to do that just to try it out?

It looks like Goggles are just text files hosted on GitHub or GitLab and you can try them out with Brave's search engine without installing anything. Some to try:

https://search.brave.com/goggles/discover

The netsec Goggle is here:

https://search.brave.com/goggles?goggles_id=https://github.c...


Kagi (https://kagi.com) has very similar tools with their "Lenses" and customizable prioritization of specific domains.


Kagi actually did it first, I think. Too bad everyone only knows about it via Brave, Kagi is an awesome search engine


Seconding, Kagi is great. I hope they succeed…


Kagi is a weird beast. I'd like to use it but I also don't understand how searches are private if I have to login. Not understanding that is definitely on me but I feel like it should be a frequent enough question that they try to make the answer obvious.


There's not much to understand about it, you just have to take their word for it that they're not storing anything. There's no way for the clients to enforce that, and I can't figure out any model that they could use to have a paid search engine that can guarantee privacy with zero trust. (Edit: a sibling points out Mullvad's payment model would help towards guaranteeing pseudonymity)

I pay for Kagi less as a privacy thing and more because it's a better search engine and consistently gets me better results than Google does.


This is an absolutely valid concern. The answer here, unfortunate as it is, is trust. There's probably things they can do here to remove the need for an actual identity (Mullvad's payment model comes to mind), but I suspect that takes more legal/financial infrastructure than they have access to at the moment.

PS: I am not a lawyer, but for what it's worth, if you _pay_ for Kagi and they advertise their privacy policy as a feature you pay for (see https://kagi.com/privacy), one could reasonable sue for false advertising or demand a refund for not providing a promised service. This would make any pushback subject to consumer protection laws in addition to whatever nascent privacy law they may be subject to.


Kagi is awesome, but unfortunately too expensive


I see Brave. I close tab. I don't trust them or anybody that pushes their offerings which are just crypto ponzi schemes.


I thought Brave was just a web browser with built-in adblock, but after your comment I decided to look it up on wikipedia. Holey moley, what a nightmare.


It's by far the best browser out there. Based on Chromium. Doesn't allow websites to abuse its technical capabilities (i.e. share your data via 3rd party ad servers). You can tip content creators. You can partake in the community and all the things Brave is building, or just use the browser in its default state. If you want a crypto + search laughing stock, check out https://coinmarketcap.com/currencies/presearch/ .


Depends on your definition of "best browser" I guess...

My best browser is Firefox, precisely because it is not based on Chromium.


Same. I wonder what the web would look like now if not for Mozilla fighting for open web standards.

I guess I don't really have to wonder. I could just factory reset any Pixel phone, swipe to the news feed, and click on any story. It's literally hard to find the content (drivel) between the advertisements and screen hijacking bullshit.


yeah, multiple times i have asked people "why isnt brave a fork of firefox" and they say "its pita" and you need quick and easy way to customize, blah blah blah.

then you have half a dozen firefox forks running absolutely fine, each one has its good and bad but the thing is, if these 1-2 man outfits can fork and release their own firefox forks with additional features/customizations, why cant brave? or do they don't want to do the actual leg work and only want to take praise for "doing best", eh


They detail the reasons for moving to Chromium https://brave.com/development-plans-for-upcoming-release/ . I was also disappointed at first, but the benefits seem to be worth it, such as use of Chrome extensions.


sorry for not responding earlier..... my initial question still remains.

there is a "The benefits of moving from Muon to Chromium" so if you replace the word chromium with firefox, do you think any line would be wrong or a lie?

extensions are available in firefox just as easily so that sounds like a lame excuse.


Firefox + uBlock Origin is still superior to Brave's ad blocker though. Brave can't use uBlock Origin on mobile.


It's just a different revenue model than the usual ad garbage. You don't have to use it.


By using it, you make it relevant by increasing its market share, and therefore you support and promote the scam.


I understand that HN hates crypto but calling everything a ponzi scheme is just absurd. They just made a currency to pay users with.


The home made currency is the ponzi scheme.


I don't see how it's a scam or ponzi scheme. It's completely opt in. You don't put in any money yourself. There is no referral nonsense.

It's just a little token they pay you with should you decide to turn on ads. It's actually a rather interesting cryptocurrency given that it's actually being used for something.

BAT is even traded at Binance. At the height of the cryptocurrency bull market, it was worth almost 2 US dollars.


The crypto stuff is disabled by default, get a new talking point.


a deliberate ponzi enabling mechanism shouldn't even be available


at least they put the safety on before throwing you the gun


Interesting. Thank you for sharing. I’m excited to fork your repo.


Shameless self-plug, I've been building some similar that you can run locally as an app: https://github.com/a5huynh/spyglass

You can define some basic rules & it'll go out and crawl those particular sites. Or use one that someone else has built. It can also sync with your Chrome/Firefox bookmarks. Would love feedback from folks who get a chance to use it !


It's interesting that this uses a distributed P2P index. That's a very good idea and one of the things that has held me back from even thinking about trying to build my own tech-focused search engine.

One thing I was hoping to see in the FAQ was how they prevent rogue nodes from inserting spam or other kinds of mischief into the public index.


I hesitate to mention it, because I had a bad experience with them and it involves "crypto," which is always a hot-button, but https://presearch.com is also playing in the distributed search engine space, and their approach actually seems reasonably well thought out <https://www.presearch.io/vision.pdf>. Unfortunately that paper does not seem to address your rogue nodes observation, which makes me doubt their success beyond their availability issues when I tried it. All of that is above and beyond the "download this closed source binary, trust us!" which noped me right out from considering running a node


They don't really. You have to apply your own filtering.


I doubt that it is a problem yet but if I were i charge of building a solution to said problem I would try to build a distributed trust system where bad nodes could be flagged and that flag spread to the rest of the network. Those that trust your node would lower the trust ranking of the flagged nodes the more flags against them the lower the ranking the rouge node would get.


Unless node identity were stable, which doesn't match my mental model of "crypto," then flagging Node 0x42e7a28dd454 doesn't prevent the Node owner from just starting a new one. See also: reddit spam accounts

For something as dynamic as "fetch HTML into the index," that actually turns into a really hard problem to solve since there's for sure not one agreed upon "correct" answer, and even with "the html must differ only by ${foo}%" or whatever, the amount a malicious actor would need to alter HTML to achieve bad outcomes is actually comparatively small


Depending on your requirements, you may be able to do it with a framework of your choice, a simple crawler and Elasticsearch.

I'm basically doing just that for my little side project with Symfony + DomCrawler + Elastic. But i'm only crawling the home page and the results are manually curated.


Use this as a personal knowledge base. Indexed my blog. Indexed a bookmarks export. Indexed a knowledge base. Works well. It also convinced me of power user ui


I keep everything on my home server: photos, music, home videos, movies, downloaded webpages, ebooks, instruction manuals, etc., all shared out over HTTP. Yacy basically gives me a centralized, private search engine for my house. Example searches: "Frigidaire manual" "living room collection:Photos" "London Philharmonic Orchestra collection:Music"

Of course, having things in an organized hierarchical file system, with good metadata, helps.


Have you considered using recoll for your use case?

I guess it should fit your "centralized private search engine for my house" application quite well.


I really liked that one time, when recoll brought up more than 10 year old photos I had mostly forgotten about. I had searched the name of a person and a few photos still had exif tags in them from when I had used Picasa back then to tag them with the person's name.


Do you expose things like photos/videos/pdfs through a public web server? I thought about using Yacy for something similar, but wasn't sure about wanting to just leave everything available unauthenticated on the local network.


I've configured my web server so that it's accessible over the Internet, but only if you have a client certificate issued from my certificate authority.


Any pointers on how do this at home? It’s the basic premise of “zero trust” but also managing phones etc seems a pain for home use.


Self plug - If you want to skip bookmarking and go straight to indexing, I have a firefox extension for it - https://github.com/tecoholic/yacy-it


> Adds the current webpage to the local YACY index. This assumes you are running YaCy in localhost:8090.

Is there a way to use this plugin with an instance of Yacy hosted somewhere other than localhost? I tried making a bookmarklet to do something similar, but never ended up getting it working.


They seem to have a configuration option for it: https://github.com/tecoholic/yacy-it/blob/main/options.html#... but that file isn't present in the most recent tag, so it's possible it just needs a release or you'd need to build the extension from source


The recent version supports configuring your own host from the extention's settings page. You don't have to build your own, just install from Firefox Add-ons site


That's what I'm doing - exporting bookmarks, links from my notes (markdown, etc), HN/Reddit upvoted links, starred github projects, etc - and then having YaCy index them.

In theory, this means I can use this as a search engine of things I found interesting / potentially useful.

In practice, I never search it, but that's more a limitation on my workflows than anything else.


That sounds promising! How often do you export your bookmarks, and in what format do you keep your knowledge base?


Firefox export as html then point yacy to it. My knowledge base is a bookstack instance


I love the idea of this, but I tried to spin up my own instance and was immediately overwhelmed by the million little knobs and settings for it.

It seems like a lot of fun if you understand all the tuning, but I feel like the current state alienates most users who want to use it in simple scenarios.


Default settings works well enough but I agree 90% should be hidden behind an advanced settings check box. (I suspect the organization of features is more obvious in German.) There are also lots of other cool things one can do that are not in the interface but arguably should be.

That said, for what it is it is pretty epic already. As a proof of concept it's completely convincing.


There are lots of settings because it's very powerful software. I don't understand the part about being overwhelmed... surely the developers have chosen sane defaults for most things and you can just ignore the ones you don't understand?


That wasn't my experience. YaCy didn't do what I wanted out of the box, so I was just left with 100+ settings that I didn't know how to adjust to get to a desired state.


What did you want to use it for, if you don't mind me asking?


Recently installed YaCy on my Synology via docker image the provide. Already saved about 10Gb of content interesting to me. Now, I have a personal Search Engine. Awesome.


So what's your workflow for using it? You mentioned it's saved "content interesting to me". Are you doing directed crawls or...?


Yeah, if it is just one articles or a blog post I crawl at depth 0, and if it is someone's personal website who I enjoy reading always, no matter what they write, I do an infinite crawl on that specific domain.


Off-topic, but how do you like Synology? I'm familiar with one of their units for work, but I'm looking into a new NAS for my home, and I'm trying to decide between Synology or building my own and putting Nextcloud on it.


Also not OP. I've got a Synology 918+ that I've used for years, and as a file store, I'm quite pleased.

I've tried running apps on it, and the ones that are available are decent, but I pretty quickly got to where I needed to SSH in to make certain things happen, and that felt weird for an appliance like this. I added Docker and ran a bunch of stuff on that, and that was kind of a pain. They don't make it easy to update the images and the community's solution is to SSH in and install watchtower to do it.

I'm now just using it for network file storage and running all those services on a Linux box instead.

I thought about just putting the drives in the Linux box, but I did some network testing and the NAS was faster, and it provides a lot of storage-related niceties, so I'm keeping it in the mix. For instance, I recently decided to upgrade the drives to faster, larger ones, and it's been pretty easy.


Thanks! So are you running the first-party Synology Drive, Moments, etc. for file/photo syncing, or do you run something like Nextcloud on your Linux box? Or do you not use software like that?


I'm not using that kind of stuff. Mine is mostly about video with a little sorta-backup for files that don't matter a ton, but I'd still rather not lose.


I see. Thanks again!


Not OP, but I've been using a Synology NAS since 2013 and it's a great product. I bought a router from them as well, which is also superb. I think it's a fabulous investment.


Grearly depends on what you are expecting from it.

After $300 per unit S. has only two advantages:

1. Form-factor: you can build a comparable small enough unit from OTC/OTS parts but usually it costs at least $200 more

2. Basic functionality (ie filesharing eg with SMB) just works, with a nice webgui to configure it.

If you need something more...


Expectations: file/photo sync, media server, ad blocking (Pi-hole). I saw that Synology has first-party apps for most of this (Synology Drive, Moments, Video).


> file/photo sync

It's photo app is NPM garbage. Sure, I have an ancient ds115j (armv7@800Mhz, 256Mb RAM), but I couldn't use it.

But despite it's ancient-ness I was able to update it to the latest Synology OS (DSM), which tells how S. is supporting their products.

So be sure to get a version with enough RAM, I would go for 2Gb+ versions in your place, so avoid ds220j and ds218play, look at ds218 (without 'play' suffix) or ds220+. Oops, their site says VMM is supported only on DS220+, so I think you have no choice there, but 220+ has an additional memory slot, which could be handy.

For literally $300[1] you can't do better.

However, there is a way to install DSM on non-Synology hardware, so if have a desktop PC to run it I would advise for you to test it out.

[0] https://www.synology.com/en-global/products?chassis=Desktop&...

[1] https://www.bhphotovideo.com/c/product/1570595-REG/synology_...


Thanks for the advice!


I used a small Synology NAS from 2012-2019, at which point I replaced it with small linux box because I wanted ZFS. Inability to support ZFS was really the only reason I replaced it; it was still working fine.


What software are you running, and how much time do you spend on maintenance?


Vanilla Ubuntu 18.04 LTS. Every couple of months or so I update all the packages and reboot. That's really all the maintenance I've ever done on it, apart from initial setup. I ought to set it up so that it can email me if a zfs scrub ever detects a problem, but I haven't done that yet.


Thanks! That's a valuable data point for my comparison.

By the way, do you run software like Nextcloud, or are you just using it as a storage tank?


I use syncthing and Plex, other than that it’s mostly just storage


Ok, thanks again!


Not OP but counter to what everyone else said, I don’t like mine. I bought the cheapest one available when it was on sale. It validated I wanted a NAS but it was too weak. Any usage would be slow and all the apps dragged it down. The apps are nice with how easy it is to install and get working, but if you wanna use it as a server not just NAS… you get what you pay for.

I still use it but plan to build a server this winter and gift the synology yo my father.


I like Synology a lot but mainly use it for storage/backup. It's a very expensive way to host containers IMO. I would look to a Mac Mini for something like that.


Thanks. My main use-cases are file/photo syncing, media server, etc. I saw that Synology has first-party apps for those things, so that would be the main draw for me.


Even as a media server I personally find it too expensive to buy one that can handle 4K transcoding, frequently needed for subtitles. I just use Synology to server the files and run Plex on a separate machine.


Ah, I see. Roughly what kind of hardware do you need for 4K transcoding? (Sorry if it's a beginner question, this is totally new to me.)


Love it, have 0 complaints! I got DS220+


Happy w my DS-220+ too


Thanks!


I would like to use this. However, in the past when I've tried it I didn't like the results. It would be nice to hear about more competition in the P2P information retrieval (search engine) tech space. YaCy seems to be the only one I've consistently heard about over the years.


Has anyone tried LinkAce? I'd love to hear someone's thoughts on YaCy vs LinkAce.

This is great timing. After looking at YaCy for my Synology NAS a few week ago, I looked at some alternatives. I like the look of LinkAce, though it seems to be less popular and I haven't found much on how a setup on a Synology NAS works.

I'd love some advice, I have a massive number of bookmarks across dozens of folders. Something like this is exactly what I'm looking for.


They serve very different purposes. While a search engine in turn can archives sites it isn't the only purpose. LinkAce is designed more for bookmarking and archiving sites akin to a bookmark manager, not as a search engine.


I did that a couple of months ago. Was planning to write something up in the next month or so.


Too bad it's in Java thus a resource hogger.


Copernic used to be a great way to do this. Register every search engine you like in the local software, apply rules, search all the web search engines at once. Until they went 100% corporate, it was awesome.


I remember really enjoying Copernic.


I have about 100,000 PDFs that I want indexed and searchable. They're on a website and I want people to be able to visit the website and search through the PDFs.

Should I use Yacy or Apache Solr?

All opinions and rants welcome.


Use Google Drive, set up a publicly shared folder, and drop the PDFs there. If you want you can even make a fancy search UX with google's search API.

https://developers.google.com/drive/api/guides/search-files

Also, YaCy is an automated web crawler that throws data into Solr, so your question doesn't make much sense.


I would recommend using Apache Tika to extract the text from the PDFs and using Solr (or Elasticsearch) to index and search them.


I've looked on the website but can't find the answer... Can YaCy index SMB file shares?





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: