logoalt Hacker News

simlevesqueyesterday at 7:18 PM9 repliesview on HN

I've been making skills from arxiv papers for a while. I have a one for multi-object tracking for example. It has a SKILL.md describing all important papers (over 30) on the subject and a folder with each paper's full content as reStructuredText.

To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.

I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.


Replies

ctothyesterday at 7:43 PM

I've been working on ctoth/research-papers-plugin, the pipeline to actually get LLMs to extract the notes. I really like your insight re RST over Markdown! It sounds like we're working on similar stuff and I'll absolutely reach out :)

show 1 reply
3abitontoday at 3:55 PM

I am surprised you found RST better than markdown.

paulluukyesterday at 7:34 PM

This sounds like it would work, but honestly if you've already read all 30 papers fully, what do you still need to llm to do for you? Just the boilerplate?

show 1 reply
alex000kimyesterday at 7:26 PM

sounds similar to "LLM Knowledge Bases" https://xcancel.com/karpathy/status/2039805659525644595

show 1 reply
gesshatoday at 12:49 AM

I’ve been meaning to build something similar. I will report back once I have something to show.

Thanks for sharing!

satvikpendemyesterday at 8:06 PM

Does that even fit in the context? It seems like 30 papers worth of content would just overflow it.

show 1 reply
MrLeapyesterday at 7:34 PM

What is RST?

show 3 replies
danelliotyesterday at 9:31 PM

[dead]

tayk47999today at 12:40 AM

[dead]