Hacker News new | past | comments | ask | show | jobs | submit | KhoomeiK's comments login

NVIDIA did something similar with GANs in 2020 [1], except users could actually play those games (unlike in this diffusion work which just plays back simulated video). Sentdex later adapted this to play GTA with a really cool demo [2].

[1] https://research.nvidia.com/labs/toronto-ai/gameGAN/

[2] https://www.youtube.com/watch?v=udPY5rQVoW0


Everything relevant in "program synthesis" moved to the new buzzword "codegen"


Depends on the extent to which you understand the Silicon Valley venture capitalism era as a permanent evolution, as opposed to an unsustainable trend!

In my nonacademic opinion, codegen is way, way less meaningful and powerful of a perspective to take on the problem than Program Synthesis. That said, I’m curious - are you mostly referencing commercial and OSS work with this comment, or does this also include institutional textbooks and tutorials and such?

FWIW, Program Synthesis is still much more popular in print according to Google Ngram - tho it’s losing ground fast!

https://i.imgur.com/bhDPzho.jpeg

Is there a similar tool that would accurately measure usage in blogs, manuals, and github pages, I wonder?


Interesting—LangChain seemed kinda like unnecessary abstractions in natural language (since everything is just string manipulations), but with AI video, there's so many different abstractions that I'd need to handle (images, puppeting, facegen, voicegen, etc).

Seems like there might be room for a "LangChain for Video" in this space...


I agree! We were definitely motivated by the emergence of AI tools for video when building Revideo - as mentioned in the post, we think that a lot of video creation can be automated using AI. Currently, there are probably some higher-priority challenges related to the core rendering library that we need to solve, but we’ve definitely already thought about building a universal client library for common AI services that are useful for generating videos (e.g. text-to-speech, text-to-image, text-to-video)


Vedic Hinduism had a similar concept of eternal fire. I recently wrote up a twitter thread [1] explaining how the modern interpretation of Vedic instructions on starting these sacred fires misunderstands the text.

Etymology tidbit: "Bhārata", India's Sanskrit name, refers to the forerunner clan that established India's first historically recorded political entity—the Kuru Kingdom—around 1200 BC near modern Delhi. The clan itself was named "Bhārata" due to their ardent bearing ("bhar-" in Sanskrit) of the sacred fire.

[1] https://x.com/khoomeik/status/1794082465398812770


That was an amazing read. So, what is likely to happen now? Does your discovery become canon? If so, is it a slow process? There won't be a schism of some kind over this, will there?


Thanks! I have no idea—unfortunately, very few Hindus maintain the Vedic fire rites. There are also no active central authorities on matters of Vedic ritual. The only plan of now is to use this interpretation in my own yajña practice.


> unfortunately, very few Hindus maintain the Vedic fire rites

Perhaps now that will change!



Is it possible to read this writeup somewhere else? I am curious, but Twitter only shows the first couple of sentences.



Thank you!


Maybe this project another commenter is working on?

https://news.ycombinator.com/item?id=40373310


Awesome pics! We love tarsiers too


Great question! See this thread:

https://news.ycombinator.com/item?id=40369713


Yes it does work headless and we do grab a fullpage screenshot including scrolling (by resizing viewport to content height). We haven’t had to deal with infinite scrolling much but that’s an interesting feature we’d appreciate a PR for.

We haven’t tried Apple’s OCR but hopefully will integrate Azure OCR soon based on others’ advice.


By Apple OCR, I mean instead of calling an external cloud API which requires tokens, etc. I simply mean vision which runs on OSX. It can be done in about 30 lines of Swift code.


We have a lot more powerful use-cases for Tarsier in web data extraction at the moment. Stay tuned for a broader launch soon!


We run OCR on the screenshot & convert it to whitespace-structured text, that is passed to the LLM. The images below might make it clearer for you:

[1] https://github.com/reworkd/tarsier/blob/main/.github/assets/...

[2] https://github.com/reworkd/tarsier/blob/main/.github/assets/...


Provided screenshots below do not show textboxes, selects, or other input nodes with labels. Show me text output with associated labels for inputs being correct and I will be shocked.


They do show textboxes with labels. From our readme:

"Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:

[#ID]: text-insertable fields (e.g. textarea, input with textual type)

[@ID]: hyperlinks (<a> tags)

[$ID]: other interactable elements (e.g. button, select)

[ID]: plain text (if you pass tag_text_elements=True)"

Do you see the search boxes labeled [#4] and [#5] at the top? And before you say that the tag is on a different line from the placeholder text—yes, and our agent is smart enough to handle that minor idiosyncrasy. Are you shocked? :)


#4 and #5 are using placeholder attributes, and the text itself is contained within the node. Show me a simple form with labels external of an input node, then rearrange the labels to be some above and some below, and I will be shocked! No placeholders. Label must be its own 'text' node.

Edit: I do not intend to come off as negative or disparaging - I already discussed this with some OS projects I work on as well as internally at work. You guys did something great, and I am just trying to point out gaps that could take it from great to unbelievable.


This problem isn't that hard, screen readers had to handle this exact issues for years. Inaccessible websites where the labels aren't properly associated with their respective form fields do exist, but aren't that common.


Yes if they are associated with accessibility attributes (Aria). Many, many sites including massive B2B do not do this (a shame). So no, you are seriously minimizing the problem. This approach would also be architecturally poorly thought out - The solution needs to not depend upon aria, nor any other non-global approach (Which this solution does so far).

Everything shown to me so far has been a solvable problem by scripts/xpath template/creation logic. I've handled all of this for over 10 years with one script. When I see it finding everything and associating them with correct external labels, then they have something. Otherwise I am concluding it non-functional and a long since solved problem where ML is over-engineering.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: