YaCy – your own search engine

dang · on Aug 25, 2022

YaCy: a free distributed search engine - https://news.ycombinator.com/item?id=12433010 - Sept 2016 (24 comments)

YaCy – Peer to Peer Search Engine - https://news.ycombinator.com/item?id=11956268 - June 2016 (3 comments)

YaCy: Decentralized Web Search - https://news.ycombinator.com/item?id=8746883 - Dec 2014 (29 comments)

YaCy takes on Google with open source search engine - https://news.ycombinator.com/item?id=3288586 - Nov 2011 (17 comments)

alxjsn · on Aug 25, 2022

If you haven't heard of Brave Goggles (https://github.com/brave/goggles-quickstart) I highly recommend checking it out. Just being able to create the search index is a massive task, so being able to apply rules server-side to their "expanded recall set" will give you what most people building search engines want, which is to control the algorithm. We weren't able to do that until now since applying rules client-side doesn't work well on a small search result set.

Related: I created a tool to create Goggles using subreddits as a signal source for domains: https://github.com/forcesunseen/narwhalizer

skybrian · on Aug 25, 2022

Seems like you're burying the lead a bit since your "Basic Usage" involves running some Docker instance for some reason and you don't need to do that just to try it out?

It looks like Goggles are just text files hosted on GitHub or GitLab and you can try them out with Brave's search engine without installing anything. Some to try:

https://search.brave.com/goggles/discover

The netsec Goggle is here:

https://search.brave.com/goggles?goggles_id=https://github.c...

mimimi31 · on Aug 25, 2022

Kagi (https://kagi.com) has very similar tools with their "Lenses" and customizable prioritization of specific domains.

rtev · on Aug 25, 2022

Kagi actually did it first, I think. Too bad everyone only knows about it via Brave, Kagi is an awesome search engine

scrollaway · on Aug 25, 2022

Seconding, Kagi is great. I hope they succeed…

Entinel · on Aug 25, 2022

Kagi is a weird beast. I'd like to use it but I also don't understand how searches are private if I have to login. Not understanding that is definitely on me but I feel like it should be a frequent enough question that they try to make the answer obvious.

lolinder · on Aug 25, 2022

There's not much to understand about it, you just have to take their word for it that they're not storing anything. There's no way for the clients to enforce that, and I can't figure out any model that they could use to have a paid search engine that can guarantee privacy with zero trust. (Edit: a sibling points out Mullvad's payment model would help towards guaranteeing pseudonymity)

I pay for Kagi less as a privacy thing and more because it's a better search engine and consistently gets me better results than Google does.

d4mi3n · on Aug 25, 2022

This is an absolutely valid concern. The answer here, unfortunate as it is, is trust. There's probably things they can do here to remove the need for an actual identity (Mullvad's payment model comes to mind), but I suspect that takes more legal/financial infrastructure than they have access to at the moment.

PS: I am not a lawyer, but for what it's worth, if you _pay_ for Kagi and they advertise their privacy policy as a feature you pay for (see https://kagi.com/privacy), one could reasonable sue for false advertising or demand a refund for not providing a promised service. This would make any pushback subject to consumer protection laws in addition to whatever nascent privacy law they may be subject to.

pt_PT_guy · on Aug 26, 2022

Kagi is awesome, but unfortunately too expensive

upupandup · on Aug 25, 2022

I see Brave. I close tab. I don't trust them or anybody that pushes their offerings which are just crypto ponzi schemes.

metalliqaz · on Aug 25, 2022

I thought Brave was just a web browser with built-in adblock, but after your comment I decided to look it up on wikipedia. Holey moley, what a nightmare.

andirk · on Aug 25, 2022

It's by far the best browser out there. Based on Chromium. Doesn't allow websites to abuse its technical capabilities (i.e. share your data via 3rd party ad servers). You can tip content creators. You can partake in the community and all the things Brave is building, or just use the browser in its default state. If you want a crypto + search laughing stock, check out https://coinmarketcap.com/currencies/presearch/ .

palata · on Aug 26, 2022

Depends on your definition of "best browser" I guess...

My best browser is Firefox, precisely because it is not based on Chromium.

metalliqaz · on Aug 26, 2022

Same. I wonder what the web would look like now if not for Mozilla fighting for open web standards.

I guess I don't really have to wonder. I could just factory reset any Pixel phone, swipe to the news feed, and click on any story. It's literally hard to find the content (drivel) between the advertisements and screen hijacking bullshit.

2Gkashmiri · on Aug 26, 2022

yeah, multiple times i have asked people "why isnt brave a fork of firefox" and they say "its pita" and you need quick and easy way to customize, blah blah blah.

then you have half a dozen firefox forks running absolutely fine, each one has its good and bad but the thing is, if these 1-2 man outfits can fork and release their own firefox forks with additional features/customizations, why cant brave? or do they don't want to do the actual leg work and only want to take praise for "doing best", eh

andirk · on Aug 28, 2022

They detail the reasons for moving to Chromium https://brave.com/development-plans-for-upcoming-release/ . I was also disappointed at first, but the benefits seem to be worth it, such as use of Chrome extensions.

2Gkashmiri · on Aug 31, 2022

sorry for not responding earlier..... my initial question still remains.

there is a "The benefits of moving from Muon to Chromium" so if you replace the word chromium with firefox, do you think any line would be wrong or a lie?

extensions are available in firefox just as easily so that sounds like a lame excuse.

matheusmoreira · on Aug 26, 2022

Firefox + uBlock Origin is still superior to Brave's ad blocker though. Brave can't use uBlock Origin on mobile.

UberFly · on Aug 25, 2022

It's just a different revenue model than the usual ad garbage. You don't have to use it.

speedgoose · on Aug 26, 2022

By using it, you make it relevant by increasing its market share, and therefore you support and promote the scam.

matheusmoreira · on Aug 26, 2022

I understand that HN hates crypto but calling everything a ponzi scheme is just absurd. They just made a currency to pay users with.

remram · on Aug 26, 2022

The home made currency is the ponzi scheme.

matheusmoreira · on Aug 26, 2022

I don't see how it's a scam or ponzi scheme. It's completely opt in. You don't put in any money yourself. There is no referral nonsense.

It's just a little token they pay you with should you decide to turn on ads. It's actually a rather interesting cryptocurrency given that it's actually being used for something.

BAT is even traded at Binance. At the height of the cryptocurrency bull market, it was worth almost 2 US dollars.

hunterb123 · on Aug 25, 2022

The crypto stuff is disabled by default, get a new talking point.

upupandup · on Aug 25, 2022

a deliberate ponzi enabling mechanism shouldn't even be available

867-5309 · on Aug 25, 2022

at least they put the safety on before throwing you the gun

stuntkite · on Aug 27, 2022

Interesting. Thank you for sharing. I’m excited to fork your repo.

a5huynh · on Aug 25, 2022

Shameless self-plug, I've been building some similar that you can run locally as an app: https://github.com/a5huynh/spyglass

You can define some basic rules & it'll go out and crawl those particular sites. Or use one that someone else has built. It can also sync with your Chrome/Firefox bookmarks. Would love feedback from folks who get a chance to use it !

bityard · on Aug 25, 2022

It's interesting that this uses a distributed P2P index. That's a very good idea and one of the things that has held me back from even thinking about trying to build my own tech-focused search engine.

One thing I was hoping to see in the FAQ was how they prevent rogue nodes from inserting spam or other kinds of mischief into the public index.

mdaniel · on Aug 26, 2022

I hesitate to mention it, because I had a bad experience with them and it involves "crypto," which is always a hot-button, but https://presearch.com is also playing in the distributed search engine space, and their approach actually seems reasonably well thought out <https://www.presearch.io/vision.pdf>. Unfortunately that paper does not seem to address your rogue nodes observation, which makes me doubt their success beyond their availability issues when I tried it. All of that is above and beyond the "download this closed source binary, trust us!" which noped me right out from considering running a node

viraptor · on Aug 25, 2022

They don't really. You have to apply your own filtering.

smegger001 · on Aug 26, 2022

I doubt that it is a problem yet but if I were i charge of building a solution to said problem I would try to build a distributed trust system where bad nodes could be flagged and that flag spread to the rest of the network. Those that trust your node would lower the trust ranking of the flagged nodes the more flags against them the lower the ranking the rouge node would get.

mdaniel · on Aug 27, 2022

Unless node identity were stable, which doesn't match my mental model of "crypto," then flagging Node 0x42e7a28dd454 doesn't prevent the Node owner from just starting a new one. See also: reddit spam accounts

For something as dynamic as "fetch HTML into the index," that actually turns into a really hard problem to solve since there's for sure not one agreed upon "correct" answer, and even with "the html must differ only by ${foo}%" or whatever, the amount a malicious actor would need to alter HTML to achieve bad outcomes is actually comparatively small

BurningPenguin · on Aug 26, 2022

Depending on your requirements, you may be able to do it with a framework of your choice, a simple crawler and Elasticsearch.

I'm basically doing just that for my little side project with Symfony + DomCrawler + Elastic. But i'm only crawling the home page and the results are manually curated.

pacifika · on Aug 25, 2022

Use this as a personal knowledge base. Indexed my blog. Indexed a bookmarks export. Indexed a knowledge base. Works well. It also convinced me of power user ui

ThinkingGuy · on Aug 25, 2022

I keep everything on my home server: photos, music, home videos, movies, downloaded webpages, ebooks, instruction manuals, etc., all shared out over HTTP. Yacy basically gives me a centralized, private search engine for my house. Example searches: "Frigidaire manual" "living room collection:Photos" "London Philharmonic Orchestra collection:Music"

Of course, having things in an organized hierarchical file system, with good metadata, helps.

ephbit · on Aug 26, 2022

Have you considered using recoll for your use case?

I guess it should fit your "centralized private search engine for my house" application quite well.

ephbit · on Aug 26, 2022

I really liked that one time, when recoll brought up more than 10 year old photos I had mostly forgotten about. I had searched the name of a person and a few photos still had exif tags in them from when I had used Picasa back then to tag them with the person's name.

johntash · on Aug 26, 2022

Do you expose things like photos/videos/pdfs through a public web server? I thought about using Yacy for something similar, but wasn't sure about wanting to just leave everything available unauthenticated on the local network.

ThinkingGuy · on Aug 26, 2022

I've configured my web server so that it's accessible over the Internet, but only if you have a client certificate issued from my certificate authority.

vineyardmike · on Aug 27, 2022

Any pointers on how do this at home? It’s the basic premise of “zero trust” but also managing phones etc seems a pain for home use.

tecoholic · on Aug 25, 2022

Self plug - If you want to skip bookmarking and go straight to indexing, I have a firefox extension for it - https://github.com/tecoholic/yacy-it

johntash · on Aug 26, 2022

> Adds the current webpage to the local YACY index. This assumes you are running YaCy in localhost:8090.

Is there a way to use this plugin with an instance of Yacy hosted somewhere other than localhost? I tried making a bookmarklet to do something similar, but never ended up getting it working.

mdaniel · on Aug 27, 2022

They seem to have a configuration option for it: https://github.com/tecoholic/yacy-it/blob/main/options.html#... but that file isn't present in the most recent tag, so it's possible it just needs a release or you'd need to build the extension from source

tecoholic · on Aug 29, 2022

The recent version supports configuring your own host from the extention's settings page. You don't have to build your own, just install from Firefox Add-ons site

ydant · on Aug 26, 2022

That's what I'm doing - exporting bookmarks, links from my notes (markdown, etc), HN/Reddit upvoted links, starred github projects, etc - and then having YaCy index them.

In theory, this means I can use this as a search engine of things I found interesting / potentially useful.

In practice, I never search it, but that's more a limitation on my workflows than anything else.

gavmor · on Aug 25, 2022

That sounds promising! How often do you export your bookmarks, and in what format do you keep your knowledge base?

pacifika · on Aug 25, 2022

Firefox export as html then point yacy to it. My knowledge base is a bookstack instance

mtlynch · on Aug 25, 2022

I love the idea of this, but I tried to spin up my own instance and was immediately overwhelmed by the million little knobs and settings for it.

It seems like a lot of fun if you understand all the tuning, but I feel like the current state alienates most users who want to use it in simple scenarios.

6510 · on Aug 25, 2022

Default settings works well enough but I agree 90% should be hidden behind an advanced settings check box. (I suspect the organization of features is more obvious in German.) There are also lots of other cool things one can do that are not in the interface but arguably should be.

That said, for what it is it is pretty epic already. As a proof of concept it's completely convincing.

bityard · on Aug 25, 2022

There are lots of settings because it's very powerful software. I don't understand the part about being overwhelmed... surely the developers have chosen sane defaults for most things and you can just ignore the ones you don't understand?

mtlynch · on Aug 25, 2022

That wasn't my experience. YaCy didn't do what I wanted out of the box, so I was just left with 100+ settings that I didn't know how to adjust to get to a desired state.

rasulkireev · on Aug 26, 2022

What did you want to use it for, if you don't mind me asking?

rasulkireev · on Aug 25, 2022

Recently installed YaCy on my Synology via docker image the provide. Already saved about 10Gb of content interesting to me. Now, I have a personal Search Engine. Awesome.

BaseballPhysics · on Aug 25, 2022

So what's your workflow for using it? You mentioned it's saved "content interesting to me". Are you doing directed crawls or...?

rasulkireev · on Aug 25, 2022

Yeah, if it is just one articles or a blog post I crawl at depth 0, and if it is someone's personal website who I enjoy reading always, no matter what they write, I do an infinite crawl on that specific domain.

Tijdreiziger · on Aug 25, 2022

Off-topic, but how do you like Synology? I'm familiar with one of their units for work, but I'm looking into a new NAS for my home, and I'm trying to decide between Synology or building my own and putting Nextcloud on it.

wccrawford · on Aug 25, 2022

Also not OP. I've got a Synology 918+ that I've used for years, and as a file store, I'm quite pleased.

I've tried running apps on it, and the ones that are available are decent, but I pretty quickly got to where I needed to SSH in to make certain things happen, and that felt weird for an appliance like this. I added Docker and ran a bunch of stuff on that, and that was kind of a pain. They don't make it easy to update the images and the community's solution is to SSH in and install watchtower to do it.

I'm now just using it for network file storage and running all those services on a Linux box instead.

I thought about just putting the drives in the Linux box, but I did some network testing and the NAS was faster, and it provides a lot of storage-related niceties, so I'm keeping it in the mix. For instance, I recently decided to upgrade the drives to faster, larger ones, and it's been pretty easy.

Tijdreiziger · on Aug 25, 2022

Thanks! So are you running the first-party Synology Drive, Moments, etc. for file/photo syncing, or do you run something like Nextcloud on your Linux box? Or do you not use software like that?

wccrawford · on Aug 26, 2022

I'm not using that kind of stuff. Mine is mostly about video with a little sorta-backup for files that don't matter a ton, but I'd still rather not lose.

Tijdreiziger · on Aug 26, 2022

I see. Thanks again!

rpdillon · on Aug 25, 2022

Not OP, but I've been using a Synology NAS since 2013 and it's a great product. I bought a router from them as well, which is also superb. I think it's a fabulous investment.

justsomehnguy · on Aug 25, 2022

Grearly depends on what you are expecting from it.

After $300 per unit S. has only two advantages:

1. Form-factor: you can build a comparable small enough unit from OTC/OTS parts but usually it costs at least $200 more

2. Basic functionality (ie filesharing eg with SMB) just works, with a nice webgui to configure it.

If you need something more...

Tijdreiziger · on Aug 25, 2022

Expectations: file/photo sync, media server, ad blocking (Pi-hole). I saw that Synology has first-party apps for most of this (Synology Drive, Moments, Video).

justsomehnguy · on Aug 27, 2022

> file/photo sync

It's photo app is NPM garbage. Sure, I have an ancient ds115j (armv7@800Mhz, 256Mb RAM), but I couldn't use it.

But despite it's ancient-ness I was able to update it to the latest Synology OS (DSM), which tells how S. is supporting their products.

So be sure to get a version with enough RAM, I would go for 2Gb+ versions in your place, so avoid ds220j and ds218play, look at ds218 (without 'play' suffix) or ds220+. Oops, their site says VMM is supported only on DS220+, so I think you have no choice there, but 220+ has an additional memory slot, which could be handy.

For literally $300[1] you can't do better.

However, there is a way to install DSM on non-Synology hardware, so if have a desktop PC to run it I would advise for you to test it out.

[0] https://www.synology.com/en-global/products?chassis=Desktop&...

[1] https://www.bhphotovideo.com/c/product/1570595-REG/synology_...

Tijdreiziger · on Aug 27, 2022

Thanks for the advice!

usefulcat · on Aug 25, 2022

I used a small Synology NAS from 2012-2019, at which point I replaced it with small linux box because I wanted ZFS. Inability to support ZFS was really the only reason I replaced it; it was still working fine.

Tijdreiziger · on Aug 25, 2022

What software are you running, and how much time do you spend on maintenance?

usefulcat · on Aug 25, 2022

Vanilla Ubuntu 18.04 LTS. Every couple of months or so I update all the packages and reboot. That's really all the maintenance I've ever done on it, apart from initial setup. I ought to set it up so that it can email me if a zfs scrub ever detects a problem, but I haven't done that yet.

Tijdreiziger · on Aug 25, 2022

Thanks! That's a valuable data point for my comparison.

By the way, do you run software like Nextcloud, or are you just using it as a storage tank?

usefulcat · on Aug 25, 2022

I use syncthing and Plex, other than that it’s mostly just storage

Tijdreiziger · on Aug 26, 2022

Ok, thanks again!

vineyardmike · on Aug 27, 2022

Not OP but counter to what everyone else said, I don’t like mine. I bought the cheapest one available when it was on sale. It validated I wanted a NAS but it was too weak. Any usage would be slow and all the apps dragged it down. The apps are nice with how easy it is to install and get working, but if you wanna use it as a server not just NAS… you get what you pay for.

I still use it but plan to build a server this winter and gift the synology yo my father.

drittich · on Aug 26, 2022

I like Synology a lot but mainly use it for storage/backup. It's a very expensive way to host containers IMO. I would look to a Mac Mini for something like that.

Tijdreiziger · on Aug 26, 2022

Thanks. My main use-cases are file/photo syncing, media server, etc. I saw that Synology has first-party apps for those things, so that would be the main draw for me.

drittich · on Aug 27, 2022

Even as a media server I personally find it too expensive to buy one that can handle 4K transcoding, frequently needed for subtitles. I just use Synology to server the files and run Plex on a separate machine.

Tijdreiziger · on Aug 27, 2022

Ah, I see. Roughly what kind of hardware do you need for 4K transcoding? (Sorry if it's a beginner question, this is totally new to me.)

rasulkireev · on Aug 25, 2022

Love it, have 0 complaints! I got DS220+

chrisweekly · on Aug 25, 2022

Happy w my DS-220+ too

Tijdreiziger · on Aug 26, 2022

Thanks!

bobajeff · on Aug 25, 2022

I would like to use this. However, in the past when I've tried it I didn't like the results. It would be nice to hear about more competition in the P2P information retrieval (search engine) tech space. YaCy seems to be the only one I've consistently heard about over the years.

sciguy77 · on Aug 25, 2022

Has anyone tried LinkAce? I'd love to hear someone's thoughts on YaCy vs LinkAce.

This is great timing. After looking at YaCy for my Synology NAS a few week ago, I looked at some alternatives. I like the look of LinkAce, though it seems to be less popular and I haven't found much on how a setup on a Synology NAS works.

I'd love some advice, I have a massive number of bookmarks across dozens of folders. Something like this is exactly what I'm looking for.

encryptluks2 · on Aug 25, 2022

They serve very different purposes. While a search engine in turn can archives sites it isn't the only purpose. LinkAce is designed more for bookmarking and archiving sites akin to a bookmark manager, not as a search engine.

rasulkireev · on Aug 25, 2022

I did that a couple of months ago. Was planning to write something up in the next month or so.

xvilka · on Aug 26, 2022

Too bad it's in Java thus a resource hogger.

10g1k · on Aug 25, 2022

Copernic used to be a great way to do this. Register every search engine you like in the local software, apply rules, search all the web search engines at once. Until they went 100% corporate, it was awesome.

Timothycquinn · on Aug 26, 2022

I remember really enjoying Copernic.

AndyMcConachie · on Aug 25, 2022

I have about 100,000 PDFs that I want indexed and searchable. They're on a website and I want people to be able to visit the website and search through the PDFs.

Should I use Yacy or Apache Solr?

All opinions and rants welcome.

px43 · on Aug 25, 2022

Use Google Drive, set up a publicly shared folder, and drop the PDFs there. If you want you can even make a fancy search UX with google's search API.

https://developers.google.com/drive/api/guides/search-files

Also, YaCy is an automated web crawler that throws data into Solr, so your question doesn't make much sense.

0bit · on Aug 26, 2022

I would recommend using Apache Tika to extract the text from the PDFs and using Solr (or Elasticsearch) to index and search them.

Jaruzel · on Aug 26, 2022

I've looked on the website but can't find the answer... Can YaCy index SMB file shares?

mdaniel · on Aug 27, 2022

It certainly seems so, based on jcifs: https://github.com/yacy/yacy_search_server/blob/master/sourc...