Agent Smart

In a previous post, i looked at OpenFace. Even though it needs “little training data”, this is still way more than one would want to collect manually.

So i’ve begun researching web agents. There seems to be 2 good solutions:

  1. BeautifulSoup – great if the page is agent friendly. Quick and reliable.
  2. Selenium – good if the page attempts to block agents. Is a full web browser and can be used to run scripts, fake cursor movement, scrolling, etc.

Good sample: StackOverflow: Using Python and BeautifulSoup (Saved webpage source codes into a local file)

I’ve hacked up some code that can slurp a bunch of data and put it into a database. It is mostly focused on image / media data, and does some reduction of data by means of content addressed storage using SHA-256 as the ID. It also has a tag system where any content can be tagged, which would normally happen through content identifies. So roughtly:

name * -> 1 ID 1 -> 1 content

tag * -> * content

It turns out that with anything but the most basic data, then KR is a problem that is still heavily researched. After some initial prototyping, i’ve decided to put this on hold for now until we get more physical robotics working. Let me know if you’d like to help out here…

