5

I was looking on the internet for chatbots. It was only fun. But now, I love this subject so much that I want to develop my own chatbot.
But the first thing is to look for a good way to manage the "brain" of my chatbot. I think that it's the best solution to save everything in a XML file, isn't it?
So the file type is clear. Comes to the relationship between different nouns etc. When I have a noun, e.g. a tree. How do I save best that a tree has leaves, branches and roots. And that a tree needs water and sunlight to survive?
Should I save it like that or otherwise?

This would be my XML for this tree-example:

<nouns>
    <noun id="noun_0">
        <name>tree</name>
        <relationship>
            <has>noun_1</has>
            <has>noun_2</has>
            <has>noun_3</has>
            <need>noun_4</need>
            <need>noun_5</need>
        </relationship>
    </noun>
    <noun id="noun_1">
        <name>root</name>
    </noun>
    <noun id="noun_2">
        <name>branch</name>
        <relationship>
            <has>noun_3</has>
        </relationship>
    </noun>
    <noun id="noun_3">
        <name>leaf</name>
    </noun>
    <noun id="noun_4">
        <name>water</name>
    </noun>
    <noun id="noun_5">
        <name>light</name>
    </noun>

    . . .

</nouns>
bmargulies
  • 97,814
  • 39
  • 186
  • 310
Paul Warkentin
  • 3,899
  • 3
  • 26
  • 35

4 Answers4

4

Data Storage Choices: It Depends

Simple, non-learning bots: XML is fine

It looks like you already have a basic XML structure worked out. For just starting out, I'd say that's fine, especially for AI support-chat kind of bots (if userMsg.contains('lega') then print('TOS & Copyright...').

Of course, switching to any new format will take time and overhead.

Learning, Complicated bots: database!

If you're looking to do something much larger, especially if you have CleverBot in mind, I think you're going to need a database. This is because when your file .. is a file and is gigantic and trying to keep it all available in memory is resource intensive. For this kind of project, I'd recommend a database.

Why? English is Complicated

A while back I wrote a nieve bayes spam sorter. It took about 10,000 pieces of spam to "train" it at a 7% accuracy rate, which took about 6 hours and 1.5GB of RAM to hold the data in memory. That's a lot of data. English is very hard and can't really be broken into if 'pony' then 'saddle', so for a bot to "learn" the best responses, your database is going to become massive and very quickly.

rlb.usa
  • 14,942
  • 16
  • 80
  • 128
  • Do you mean a database like MySQL or do you think I can use Core Data in Cocoa, too? I know that's not a database but, yeah. But the structure of above is good to do anything like CleverBot, isn't it? – Paul Warkentin Sep 16 '11 at 21:45
  • I don't know about Cocoa's Core Data. But to do something like CleverBox, not only do you have to teach it relationships (like what you have), but also how to talk and make English sentences (not sure if CleverBox uses Bayesian here or not...) – rlb.usa Sep 16 '11 at 21:49
  • Ok, I understand. Is it possible that if I teach my program to recognize and analyze sentences and when the software doesn't know a word, I tell him what it is, like a part of a tree or something like that and so the bot learns? – Paul Warkentin Sep 16 '11 at 21:53
  • Well, are you prepared to type out responses to 10,000 pieces of data? – rlb.usa Sep 16 '11 at 21:55
  • 1
    Bottom line here is that teaching!= learning and it introduces your own errors. It's better to train it (give it a set of data, let it crunch and figure out relationships on its own). – rlb.usa Sep 16 '11 at 21:56
  • But how does a bot know the relationship between a tree and a leaf if they have no other information? It could figure out the relationship on its own if it knows that the tree has branches and these has leaves on their ends. So the bot knows that trees have leaves. But with other data where no connection is? – Paul Warkentin Sep 16 '11 at 22:04
2

I think we can model this information as an ontology. You can encode much richer information, in terms of relations, attributes, levels etc. There are formats like RDF, OWL etc. which you can use and are supported by almost all languages.

And most importantly, managing data would be be easy if you use an ontology editor , i would recommend Protege (http://protege.stanford.edu/), take a look at it.

Monis
  • 91
  • 4
  • Hello, great thing, these formats. Is it availabe for Cocoa, too? I didn't found anything. – Paul Warkentin Sep 20 '11 at 16:32
  • Hi, I have no idea about the support for Cocoa, though you can check Redland it talks about cocoa bindings. On the other hand these are just a formats/frameworks for knowledge representation, you can always write your own handlers. – Monis Sep 20 '11 at 19:20
1

You could also try something like a graphdb that Freebase uses to store relations between various entities. Basically, it is a graph of nodes and edges, and each node has attributes and values for those attributes. The edges also have attributes similar to nodes and an edge connecting two nodes defines a relationship between them.

London guy
  • 27,522
  • 44
  • 121
  • 179
0

You are probably looking at a database. Any serious NLP system would be using one, unless you have a rule-based thing which operates on a small set of rules. Think about whether you would want to write a piece of C code that handles a 5 MB xml file. I would most definitely not. Stanford university host a nice demo if you are interested in the linguistic side of it.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79