How do you find the definition of a function when all you have is a huge set of input/ouput pairs?

Question

Suppose that you were given a list of input/ouput pairs:

f 0 = 0
f 1 = 2
f 2 = 1
f 3 = -1
f 4 = 0
f 5 = 0
f 6 = -76
f 7 = -3
f 8 = 3
f 9 = -1
f 10 = -1
f 11 = -6
f 12 = -1
f 13 = -1
f 14 = 4
f 15 = -2
f 16 = -10
f 17 = 0
f 18 = 0
f 19 = -1
f 20 = 2
f 21 = 3
f 22 = 0
f 23 = 4
f 24 = 2
f 25 = -1
f 26 = 0
f 27 = 0
f 28 = -4
f 29 = -2
f 30 = -14

Now suppose you were asked to find the definition of f using a proper, small mathematical formula instead of an enumeration of values. That is, the answer should be f x = floor(tan(x*x-3)) (or similar), because that is a small formula that is correct for every input. How would you do it?

Well, there are infinite solutions, so presumably you're looking for some sort of optimal one? — Hamish, May 07 '14 at 02:49
That code in your question is a valid Haskell definition of that function. It looks like you already have one :-) — Lee Duhem, May 07 '14 at 02:58
@Hamish I'm just looking for an algorithm that can find the definition on the Spoil link by feeding it with input/output pairs. If there are many answers, maybe I can just feed it more input/output pairs? To be specific, the actual function I'm trying to find is the interpreter for an extension of Simply Typed Lambda Calculus. Coding it manually is very complicate and error prone - my current implementation bugs in a few rare cases - and finding the problem is nowhere near trivial. — MaiaVictor, May 07 '14 at 03:00
@Viclib you aren't going to find a single algorithm that works in every case, otherwise we'd have perfect AI by now and programs would be written solely by providing inputs and outputs. Instead, you could write algorithms that can _approximate_ numeric functions through various fitting methods, such as Taylor series. — bheklilr, May 07 '14 at 03:09
But what if the function I am trying to find does not operate on numbers, but trees? — MaiaVictor, May 07 '14 at 03:10
@Viclib When you discover an algorithm that lets you give input/output pairs and have it provide a valid, accurate, and complete function definition for arbitrary input and output types, let me know. I'd like to use it to make my job unnecessary. That being said, there are techniques for taking input data and known outputs and analyzing it in particular domains, but nothing that works entirely in general. — bheklilr, May 07 '14 at 03:18
@Viclib, the closest thing you can get for mathematical functions is [interpolation](http://en.wikipedia.org/wiki/Interpolation) (to obtain functional expression inside of the points domain) and [extrapolation](http://en.wikipedia.org/wiki/Extrapolation) (to obtain functional expression outside of the points domain). These are all numerical methods, however. It is impossible to do what you want in general, especially when domain of your function is not numbers. — Vladimir Matveev, May 07 '14 at 03:28
@Viclib Anything can be encoded as numbers, including trees. — molbdnilo, May 07 '14 at 08:54
@molbdnilo Not entirely true, that would depend on the cardinality of your data type and what kind of number you're trying to encode them as. For example, there's no way to encode real numbers as integers since the reals are uncountably infinite while the integers are countably infinite. There are also different ["levels" of uncountably infinite](http://en.wikipedia.org/wiki/Aleph_number). So no, not everything can be coded directly as numbers. Even if it were possible, it would be very difficult to do so or interpret them, as you would want the encoding to be isomorphic. — bheklilr, May 07 '14 at 13:01
@bheklilr Gosh, you're even more of a nitpicker than me. ;-) When dealing with actual computers that actual people actually use, I usually don't worry about infinities too much. — molbdnilo, May 07 '14 at 13:17
@molbdnilo Sorry, it's my math background =) The problem is that while computers can't handle true infinite structures, Haskell can simulate them, and it can tell the difference between an infinite structure and a finite one that's still equal up to its last elements. If you asked if `[1] == [1..]`, Haskell would return `False`, so logically you would have to have `encode [1] /= encode [1..]`, otherwise your `encode` would not be an isomorphism. Even if you could write an isomorphic `encode`, how would it behave topologically? Would continuous functions still be continuous? continued... — bheklilr, May 07 '14 at 14:02
... What about other algebraic properties such as commutativity and associativity? There's a lot of questions to answer when trying to encode arbitrary, possibly infinite data types as numbers. — bheklilr, May 07 '14 at 14:03
@Viclib "I'm just looking for an algorithm that can find the definition on the Spoil link by feeding it with input/output pairs". You've "just" restated the problem -- what's the difference between the one in the Spoil link, and the other infinite number of solutions? Once you've defined that, you can start on some sort of algorithm, maybe a genetic approach, for example. But as it is, you haven't defined the problem sufficiently to solve it :) — Hamish, May 08 '14 at 02:15
@Hamish I've stated that the smallest satisfying solution is what I want, but a bigger solution is fine as long as it is correct. The only thing that is not correct is enumerating the limited set of input/output pairs because that is obviously huge and not generic. I've actually managed to solve the problem myself, refer to my answer below. Unfortunately it obviously won't scale. I'd like to see an improvement that does. — MaiaVictor, May 08 '14 at 03:31

score 8 · Accepted Answer · answered May 07 '14 at 04:25

So let's simplify. You want a function such that

f 1 = 10
f 2 = 3
f 3 = 8

There exists a formula for immediately finding a polynomial function which meets these demands. In particular

f x = 6 * x * x - 25 * x + 29

works. It turns out to be the case that if you have the graph of any function

{ (x_1, y_1), (x_2, y_2), ..., (x_i, y_i) }

you can immediately build a polynomial which exactly matches those inputs and outputs.

So, given that polynomials like this exist you're never going to solve your problem (finding a particular solution like floor(tan(x*x-3))) without enforcing more constraints. In particular, if you don't somehow outlaw or penalize polynomials then I'm always going to deliver them to you.

In general, what you'd like to do is (a) define a search space and (b) define a metric of fitness, also known as a loss function. If your search space is finite then you have yourself a solution immediately: rank every element of your search space according to your loss function and select randomly from the set of solutions which tie for best.

What it sounds like you're asking for is much harder though—if you're looking through the space of all possible programs then that space is unbelievably large. Searching it exhaustively is impossible unless we constrain ourselves heavily or accept approximation. Secondly, we must have very good understanding of your loss function and how it interacts with the search space as we'll want to make intelligent guesses to move forward through this vast space.

You mention genetic algorithms—they're often lauded for this kind of work and indeed they can be a method of driving search through a large space with an uncertain loss function, but they also fail as often as they succeed. Someone who is genuinely skilled at using genetic algorithms to solve problems will spend all of their time crafting the search space and the loss function to direct the algorithm toward meaningful answers.

Now this can be done for general programs if you're careful. In fact, this was the subject of last year's ICFP programming contest. In particular, search on this page for "Rules of the ICFP Contest 2013" to see the set up.

That is kinda a solution, but not what I was trying to get there. I've managed to find the correct function (the one with tan) with a brute-force search over mathematical expressions in a few seconds. I've also learned how much harder it becomes by slightly increasing the tree. Sad. If just I had a much, much more powerful computer, I would probably be able to automate a lot of my work this way (: — MaiaVictor, May 08 '14 at 01:04
You succeeded by constraining your search space ;) The problem with program ASTs is that they grow really quickly and that they don't explore "interesting" programs any more quickly than they explore boring or broken ones. — J. Abrahamson, May 08 '14 at 14:55
Uh huh! I am aware this is unfeasible for anything larger than the absolute trivial, but that was the point of the question (: — MaiaVictor, May 08 '14 at 18:39
Oh just noticed I forgot to mark/thank you. Great answer, thanks. — MaiaVictor, Nov 23 '14 at 15:06

score 2 · Answer 2 · answered Nov 22 '14 at 12:03

2

I think feed forward neural network (FFNN) and genetic programming (GP) are good techniques for complicated function simulation. if you need function as polynomials use the GP otherwise FFNN is very simple and the matlab have a library for it.

answered Nov 22 '14 at 12:03

Afshin Khoshraftar

56
3

This is 100% the solution I would go with. One caveat with FF ANNs may be that they are very good at interpolation, but not so much with extrapolation. So you may have trouble with something like f 50. – zcleghern Apr 07 '15 at 15:41

score 0 · Answer 3 · answered May 08 '14 at 03:29

I think the "interpolation" don't get what I am asking. Maybe I was not clear enough, but fortunately I've managed to get a semi-satisfactory answer to my question using a brute-force search algorithm myself. Using only a list of input/output pairs, as presented in the question, I was able to recover the original function. The comments on this snippet should explain it:

import Control.Monad.Omega

{- First we define a simple evaluator for mathematical expressions -}
data A =    Add A A | Mul A A | Div A A | Sub A A | Pow A A |
            Sqrt A | Tan A | Sin A | Cos A |
            Num Float | X deriving (Show)
eval :: A -> Float -> Float
eval (Add a b) x = eval a x + eval b x
eval (Mul a b) x = eval a x * eval b x
eval (Div a b) x = eval a x / eval b x
eval (Sub a b) x = eval a x - eval b x
eval (Pow a b) x = eval a x ** eval b x
eval (Sqrt a) x = sqrt (eval a x)
eval (Tan a) x = tan (eval a x)
eval (Sin a) x = sin (eval a x)
eval (Cos a) x = cos (eval a x)
eval (Num a) x = a
eval X x = x

{- Now we enumerate all possible terms of that grammar -}
allTerms = do
  which <- each [1..15]
  if which == 1 then return X
  else if which == 2 then do { x <- allTerms; y <- allTerms; return (Add x y) }
  else if which == 3 then do { x <- allTerms; y <- allTerms; return (Mul x y) }
  else if which == 4 then do { x <- allTerms; y <- allTerms; return (Div x y) }
  else if which == 5 then do { x <- allTerms; y <- allTerms; return (Sub x y) }
  else if which == 6 then do { x <- allTerms; y <- allTerms; return (Pow x y) }
  else if which == 7 then do { x <- allTerms; y <- allTerms; return (Sqrt x) }
  else if which == 8 then do { x <- allTerms; y <- allTerms; return (Tan x) }
  else if which == 9 then do { x <- allTerms; y <- allTerms; return (Sin x) }
  else if which == 10 then do { x <- allTerms; y <- allTerms; return (Cos x) }
  else return (Num (which-10))

{- Then we create 20 input/output pairs of a random function -}
fun x = x+tan(x*x)
maps = let n=20 in zip [1..n] (map fun [1..n])

{- This tests a function in our language against a map of in/out pairs -}
check maps f = all test maps where
    test (a,b) = (eval f a) == b

{-  Naw lets see if a brute-force search can recover the original program
    from the list of input/output pairs alone! -}
main = print $ take 1 $ filter (check maps) (runOmega allTerms)

{-  Ouput: [Add X (Tan (Mul X X))]
    Yay! As much as there are infinite possible solutions,
    the first solution is actually our initial program.
-}

score -1 · Answer 4 · answered May 07 '14 at 03:02

-1

One possible definition goes like this:

f 0 = 0
f 1 = 2
f 2 = 1
f 3 = -1
f 4 = 0
f 5 = 0
f 6 = -76
f 7 = -3
f 8 = 3
f 9 = -1
f 10 = -1
f 11 = -6
f 12 = -1
f 13 = -1
f 14 = 4
f 15 = -2
f 16 = -10
f 17 = 0
f 18 = 0
f 19 = -1
f 20 = 2
f 21 = 3
f 22 = 0
f 23 = 4
f 24 = 2
f 25 = -1
f 26 = 0
f 27 = 0
f 28 = -4
f 29 = -2
f 30 = -14

answered May 07 '14 at 03:02

n. m. could be an AI

112,515
14
128
243

This fails for `(f 31)`. It is supposed to find `f x = floor(tan(x*x-3))` (and other equivalent functions, perhaps). – MaiaVictor May 07 '14 at 03:04
9

No requirement is stated for `f 31`. No requirement is stated for how the definition in question should look like. If you have such requirements, please include them in the question. – n. m. could be an AI May 07 '14 at 03:06
1

I don't know how to state this formally. I hoped the intention was clear from what I asked. I updated the question to be more specific. – MaiaVictor May 07 '14 at 03:13
3

@n.m. You're tone is not appropriate, and your answers are in no way helpful to the OP. Learn from J. Abrahamson how to write helpful, constructive, polite answers to questions that lack formal rigor. – Bolo May 07 '14 at 11:48

How do you find the definition of a function when all you have is a huge set of input/ouput pairs?

4 Answers4