I've been making skills from arxiv papers for a while. I have a one for multi-object tracking f...

simlevesque • yesterday at 7:18 PM • 9 replies • view on HN

I've been making skills from arxiv papers for a while. I have a one for multi-object tracking for example. It has a SKILL.md describing all important papers (over 30) on the subject and a folder with each paper's full content as reStructuredText.

To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.

I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.

Replies

ctoth • yesterday at 7:43 PM

I've been working on ctoth/research-papers-plugin, the pipeline to actually get LLMs to extract the notes. I really like your insight re RST over Markdown! It sounds like we're working on similar stuff and I'll absolutely reach out :)

➕ show 1 reply

3abiton • today at 3:55 PM

I am surprised you found RST better than markdown.

paulluuk • yesterday at 7:34 PM

This sounds like it would work, but honestly if you've already read all 30 papers fully, what do you still need to llm to do for you? Just the boilerplate?

➕ show 1 reply

alex000kim • yesterday at 7:26 PM

sounds similar to "LLM Knowledge Bases" https://xcancel.com/karpathy/status/2039805659525644595

➕ show 1 reply

gessha • today at 12:49 AM

I’ve been meaning to build something similar. I will report back once I have something to show.

Thanks for sharing!

satvikpendem • yesterday at 8:06 PM

Does that even fit in the context? It seems like 30 papers worth of content would just overflow it.

➕ show 1 reply

MrLeap • yesterday at 7:34 PM

What is RST?