logoalt Hacker News

Show HN: I built an AI dataset generator

147 pointsby matthewhefferonyesterday at 2:58 PM30 commentsview on HN

Comments

mritchie712yesterday at 3:52 PM

I use this prompt to spin up demos for customers at https://www.definite.app/:

    @Web Do some research on https://somecompany.com and write up a detailed overview of what the company does. What might their database schema look like?

    I need you to build a mock database for them in duckdb for a demo

Then:

    Create a uv project and write a python script to add demo data. Use Faker.

    @Web research how many customers they have. Make the database to appropriate scale.

Only takes a few minutes in Cursor, should work just as well in Claude Code. It works really well for the companies core business, but I still need to create one to populate 3rd party sources (e.g. Stripe, Salesforce, Hubspot, etc.).
show 1 reply
matthewhefferonyesterday at 3:00 PM

I was tired of digging through Kaggle and writing prompts over and over just to get fake data for dashboards and demos. So I built a little tool to help me out.

It uses GPT-4o to generate a detailed schema and business rules based on a few dropdowns (like business type, schema structure, and row count). Then Faker fills in the rows using those rules, which keeps it fast and cheap.

You can preview the data, export as CSV or SQL, or spin up Metabase with one click to explore the data. It’s open-source, still in early stages, but wanted to share, get feedback and see how you'd improve it.

show 1 reply
paxysyesterday at 5:13 PM

Feature request - make the URL for the OpenAI API configurable. That way one can swap it out with Anthropic or any other LLM provider of their choice that provides an OpenAI-compatible API.

show 1 reply
b0a04glyesterday at 4:13 PM

seen this pattern a before too. faker holds shape without flow. real tables come from actions : retry, decline, manual review, all that. you just set col types, you might miss why the row even happened. gen needs to simulate behavior, not format

show 4 replies
MattSayaryesterday at 5:45 PM

I used Anthropic's new Claude API integration with artifacts to make a probably-worse version that you can play with (after logging in of course).

https://claude.ai/public/artifacts/eb7d8256-6d21-4c85-af9b-c...

I used this GitHub repo as context and Claude Opus 4 to create this artifact

ChrisMarshallNYtoday at 1:41 AM

I wrote a Swift CLI app to generate dummy user profiles for an app we wrote (I needed many more than we’ll actually get, and I needed screenshots for the App Store that didn’t have real user data).

It was pretty “dumb,” and used thispersondoesnotexist.com for profile pics.

klntskytoday at 10:25 AM

You absolutely do not need docker as a requirement here

jasonthorsnessyesterday at 5:07 PM

AI is really good at this sort of thing; I've been using an LLM with Faker for some time to load data for demos into SingleStore: https://github.com/jasonthorsness/loadit

show 1 reply
reedlawyesterday at 9:04 PM

"Dataset" connotes training data, but this seems to generate sample data, maybe for testing an application. Is there any use for synthetic datasets in ML?

smcleodyesterday at 9:16 PM

This is a bit confusing, I sort of expected it to be a bit like Kiln https://github.com/Kiln-AI/Kiln to generate datasets for AI, but it looks like the outputs are more just data / files than datasets?

wiradikusumayesterday at 5:17 PM

"Stack: OpenAI API (GPT-4o for data generation)" -- I wonder if someday we'll have a generic API like how it's done in Java (e.g., Servlet API implemented by Tomcat, JBoss etc), so everyone can use their favorite LLM instead of having to register each provider like streaming services e.g. Disney+, Netflix, etc.

show 1 reply
ajar8087today at 12:38 AM

I was thinking more synthetic data to fit models like https://whitelightning.ai/

jmsdnnsyesterday at 7:22 PM

depending on what you're using the synthetic data for, it is sometimes called distillation. here is a robust example from some upenn students: https://datadreamer.dev/

margotliyesterday at 3:40 PM

Feels like a useful tool for anyone learning analytics or just needing sample data to test with.