February 2010
M T W T F S S
« Jan    
1234567
891011121314
15161718192021
22232425262728
Print This Post

Spam Filtering by Learning a Pattern Language

The New Scientist describes a new method for spam detection by learning patterns. This new method exploits the spammers most powerful weapon - the automatic generation of many, similar messages by automated means (i.e., some grammar in a formal language) - and turns it against them. The article reports that a pattern can reliably be learned from about 1000 examples captured from a bot, allowing the method to classify new messages accurately and with zero false positives. This sounds really exciting given my full spam-folder.

However, I’m a bit cautious. The article is a bit sparse on technical details, so I might make some wrong assumptions here. First, zero false positives reported is the discrimination of spam from that particular spam-grammar versus other messages. At least that’s how I understand it. Second, it seems from the article that they only learn from positive examples. Overall the technique sounds to me like they are learning a pattern language. Pattern languages are a class of grammars that overlap with linear and context-sensitive grammars (Chomsky hierarchy). Unfortunately they don’t have a real Wikipedia page so I’ll try to give a bit of background. The closest I can give for an example right now would be regular expressions with back-references. I’m not sure if this is an accurate description for all possible patterns, but it’s close enough for an example.

I don’t know how the specific technique mentioned in the article works in detail, but I’ve learned two things about learning grammars from text: (a) we can’t learn all linear or context-sensitive languages, only all pattern language grammars; (b) learning patterns without negative examples leads to over-generalization really really fast.

While I haven’t worked with learning grammars in a long while, the only algorithm of which I’m aware is the Lange-Wiehagen algorithm (Steffen Lange and Rolf Wiehagen; Polynomial-time inference of arbitrary pattern languages. New Generation Computing, 8(4):361-370, 1991). This algorithm is not a consistent learner, but can learn all pattern languages in polynomial time. There might be better ones available by now, but learning grammars is not that popular in the machine learning community right now. I’m sure there are some other interesting applications besides spam filtering. Maybe it’s time for a revival.

Overall, it sounds like a promising new anti-spam technique, but I’d like to see some more realistic testing done. There are some obvious ways for spammers to make learning these patterns harder, but either way I’m curious - maybe the inventors of this technique discovered a better way to learn patterns? Maybe by using some problem-specific domain knowledge?

Print This Post

Strong profiling is not mathematically optimal for discovering rare malfeasors (on rare event detection)

Just in time for the latest Christmas terror scare, I came across an interesting paper: “Strong profiling is not mathematically optimal for discovering rare malfeasors” (William H. Press; PNAS 106(6), p. 1716-1719 www.pnas.org/cgi/doi/10.1073/pnas.0813202106). In the paper, the author investigates whether profiling by nationality or ethnicity can be justified mathematically and tries to answer the question of how much screening must we do, on average, to catch the bad guys in the crowd. Rare events detection is hard as it is, and it’s interesting to see a look from the sampling perspective. It’s an interesting and short read. Long story short, it shows that using an indiscriminate feature like nationality or ethnicity is not optimal (as is any screening at least in proportion to a prior probability) and wastes resources.

Print This Post

Starcraft AI competition

UCSC is holding a Starcraft AI competition. I wish I had the time to participate… Starcraft is one of my all time favorite games, and writing a better AI for a real-time strategy game is certainly interesting and challenging.

Print This Post

Random characters in text mode -> graphics card

Quick note: One of the strangest things I’ve seen in a while was during my desktop’s boot-up today. There were random lines across the manufacturer’s BIOS logo, then all sorts of weird and random characters during BIOS messages and boot-manager. The monitor was fine, the power-on self test didn’t indicate anything fishy and even Linux would boot fine (but only in 640 x 480 resolution). If it had been the RAM or something, chances would be that the OS would have crashed or complained. Obviously it wasn’t a driver or OS issue as the computer hadn’t even booted up yet. It turns out it was the graphics card (an old 7xxx nVidia) and replacing it with a newer one did the trick. I’m a bit puzzled how the graphics card could have caused all those weird characters to show up, but I’m guessing the graphics RAM might have died or something like that.

Print This Post

Programs stealing the input focus

Ok, this post is more of a rant. I’m one of those people that are a bit impatient when starting a program on my desktop. When I start up my Windows machine I click on several buttons in the “quicklaunch” bar to fire up what I’ll need to use - Outlook, R / SPSS /SAS, Winamp etc. So why do all sorts of dialogs pop up in my face while I am typing? Why does winamp have to pop up while I’m typing my email password? And why do they have to switch the input focus so that whatever I’ve happen to type now ends up in the wrong window? This is so annoying. Stealing the input focus is a known problem that has been written about countless times. It’s even against the GUI programming guidelines. “Do not steal the input focus” - what’s so difficult about that?

As a first consequence Norton Internet Security is now gone from my machine forever after it kept reminding me constantly - specifically with an uncanny accuracy when I was busy playing computer games - that I need to renew my anti-virus subscription or bad things will happen to my computer. And bad things did happen to my video game. But not anymore…

On the upside, there’s a carefully hidden option in the Windows XP Powertoys (TweakUI) that is supposed to prevent programs from stealing the input focus. It made things better, but doesn’t seem to work all the time.

Print This Post

Famous bugs in AI game engine caught on tape

Found this on aigamedev and some of them are really hilarious: AI game bugs caught on tape

Print This Post

Vundo?

My girlfriend caught a new (?) version of some malware on her machine; what a nuisance and scanners don’t seem to recognize this thing… Some think it’s Vundo others just complain that it’s packed. It doesn’t quite fit the Vundo description,though. MD5 8e06f428178cbfbf12a8372fa6b16d0d size 50688 bytes. It registers some CLSID 721ee819 - b263 - 42e0 - a594 - b82fd0f24bdf , a browser-helper object and various things for notifications by the LSA service plus AppInit_Dll. It constantly restores these keys and it seems that even stomping out all the threads that this DLL-thing spawned everywhere won’t help. I overlooked something and it just comes back as soon as the next GUI app is started. As soon as I know how to get rid of it, I’ll update this post.

Update 1:

It hooks AppInit, the run key using rundll32 to start itself and the LSA notification (something Hijackthis doesn’t check). I can kill all the threads that this thing generates in each executable with ProcessExplorer and regmon will show that the constant checking of the appinit-key stops. However, as soon as the next GUI application is started it is back. So I deleted all the events and mutex objects that things created (I found some clues in the strings in memory) in each executable, again making sure that I didn’t miss anything, and it took a few seconds this time for it to come back. There’s “something” that will load the DLL with OpenProcess to load the DLL into the process space. Since the strings in the DLL show that it opens and writes to process memory this wouldn’t be surprising; question is how I find the threads that do this. Other odd things include that svchost starts a window-less iexplore.exe presumably to upload some stuff to a server or something. It might have some sloppy rootkit (RootkitRevealer went nuts with file-system discrepancies), because I can’t find the DLL (using “dir”) referenced in the keys, yet the tab-extension finds it and overwriting the non-existant file gets an access denied. Some interesting strings from the decrypted memory image of the DLL:

wscntfy.exe wscntfy_mtx mrt.exe explorer.exe iexplore.exe opera.exe firefox.exe Global\ mrt.exe explorer.exe iexplore.exe opera.exe firefox.exe dll .tmp exe rdl InprocServer32 \Internet Explorer\PhishingFilter Enabled Rundll32.exe ” ThreadingModel Both \Internet Explorer\ieuser.exe -Embedding tmp MS Juan cpm las SHELL32.dll ole32.dll OLEAUT32.dll vector<T> too long unknown ntoskrnl.exe ntkrnlmp.exe ntkrnlpa.exe ntkrpamp.exe Mozilla/4.0 (compatible; MSIE 6.0) WinNT 5.1 LoadLibraryW Kernel32 SeDebugPrivilege http://82.98.235.208/form/index.html exficale.com pancolp.com /frame.html url suid dnsapi.dll DnsQuery_A DnsRecordListFree Global\ wuauserv SYSTEM CURRENT_USER Advapi32.dll ConvertStringSidToSidA IsWow64Process kernel32 shell32.dll SHGetKnownFolderPath wininet.dll InternetOpenUrlA HttpOpenRequestA InternetCloseHandle InternetConnectA InternetOpenA InternetSetOptionA InternetQueryOptionA HttpQueryInfoA HttpSendRequestA InternetReadFile HttpAddRequestHeadersA HTTP/1.1 POST Content-Length ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/ InprocServer32 setupapi.dll IsUserAdmin BITS b’kJ SHGetFolderPathW CoCreateInstance CoTaskMemFree CoInitialize CoUninitialize CoCreateGuid __dllonexit _onexit _XcptFilter _initterm _amsg_exit _adjust_fdiv WriteFile FlushFileBuffers LocalFree CreateFileW GetFileSize VirtualAlloc ReadFile VirtualFree GetModuleFileNameW lstrcpyW CreateMutexW GetLastError WaitForMultipleObjects GetExitCodeThread lstrlenW OpenMutexW WaitForSingleObject GetProcAddress GetModuleHandleA OpenProcess VirtualAllocEx WriteProcessMemory CreateRemoteThread VirtualFreeEx CreateToolhelp32Snapshot Process32FirstW lstrcmpiW Process32NextW GetCurrentProcess OpenEventW SetEvent Sleep ResetEvent lstrcatW MoveFileW MoveFileExW SetFilePointer SetEndOfFile ReleaseMutex GetModuleFileNameA DisableThreadLibraryCalls ExitProcess LoadLibraryW InitializeCriticalSection DeleteCriticalSection EnterCriticalSection LeaveCriticalSection GetSystemTimeAsFileTime FreeLibrary LoadLibraryA GetLogicalDriveStringsW GetDriveTypeW DeleteFileW GetTickCount GetCurrentThreadId CreateDirectoryW GetSystemTime SystemTimeToFileTime SetFileTime GetWindowsDirectoryA GetVolumeInformationA CreateProcessW OpenMutexA OpenEventA GetCurrentThread GetCurrentProcessId TerminateProcess TerminateThread CreateEventW WideCharToMultiByte HeapAlloc GetProcessHeap HeapFree SetFileAttributesW InterlockedIncrement InterlockedDecrement GetVersion lstrcmpiA lstrcpynW InterlockedExchange InterlockedCompareExchange RtlUnwind QueryPerformanceCounter UnhandledExceptionFilter SetUnhandledExceptionFilter KERNEL32.dll CallNextHookEx SetWindowsHookExA PostMessageA UnhookWindowsHookEx GetSystemMetrics USER32.dll OpenProcessToken LookupPrivilegeValueA AdjustTokenPrivileges RegCreateKeyExW RegDeleteValueW RegFlushKey RegCloseKey RegDeleteKeyW RegQueryValueExW RegSetValueExW RegOpenKeyExW SetSecurityInfo RegEnumValueW GetTokenInformation IsValidSid ConvertSidToStringSidW OpenSCManagerA OpenServiceA ControlService ChangeServiceConfigA AllocateAndInitializeSid CheckTokenMembership FreeSid InitializeSecurityDescriptor SetSecurityDescriptorDacl ConvertStringSidToSidA SetEntriesInAclA DuplicateTokenEx SetTokenInformation GetLengthSid SetThreadToken RegQueryInfoKeyA RegEnumKeyExA RegOpenKeyExA RegQueryValueExA CloseServiceHandle QueryServiceConfigA QueryServiceStatusEx StartServiceA ADVAPI32.dll LocalAlloc RaiseException _except_handler3 222.dll DllCanUnloadNow DllGetClassObject Software\Microsoft\Windows\CurrentVersion\Run Software\Microsoft\Windows\CurrentVersion\Explorer\Browser Helper Objects CLSID SYSTEM\CurrentControlSet\Control\Lsa Notification Packages Software\Microsoft\Windows NT\CurrentVersion\Windows AppInit_DLLs LoadAppInit_DLLs Software\Microsoft\Internet Explorer\Main Check_Associations Software\Microsoft\Windows\CurrentVersion\Ext\Settings Software\Microsoft SYSTEM\CurrentControlSet\Control\Session Manager PendingFileRenameOperations PendingFileRenameOperations2 Software\Microsoft\Windows\CurrentVersion\Explorer\ShellExecuteHooks Software\Microsoft\Security Center UpdatesDisableNotify Software\Microsoft\Security Center\Svc EnableNotifications EnableNotifications\Ref Software\Microsoft\Windows NT\CurrentVersion DigitalProductId RegisteredOrganization RegisteredOwner C:\WINDOWS\system32\renobuda C:\WINDOWS\system32\calc.exe C:\WINDOWS\system32\defariha.dll C:\WINDOWS\system32\defariha.dll C:\WINDOWS\system32\dadeyisi.dll C:\WINDOWS\system32\vofehafi.dll {721ee819-b263-42e0-a594-b82fd0f24bdf} Global\vimegolatiturew Global\nifuseguji C:\WINDOWS\system32\mrt.exe own1 hdn_dsk .uroledup.com .uroledup.com .?AVCDownloader@@ .?AVCUrlStorage@@ .?AUIObjectWithSite@@ .?AVCConBHO@@ .?AUIUnknown@@ .?AUIClassFactory@@ .?AVCFactory@@

Update 2: Ok, I got rid of it. Turns out there’s no root-kit; the DLL was simply marked as hidden (I feel stupid…). Killing all the threads off, preventing it from re-loading and then re-installing the Service-Pack seems to have gotten rid of it for good.

Print This Post

Filler items for Amazon Super Saver shipping

Being only a few cents shy to get the free shipping I found the following:Picasso Art Stickers ($1.50 and eligible for Amazon’s Super Saver shipping) Also: List of fillter-items from other people. Google can also help you find stuff for the exact amount. Anybody found some somewhat useful items?

Print This Post

Automation of Science

Two interesting articles in Science: The Automation of Science is about a robotic system that autonomously generated functional genomic hypotheses about a yeast. The second article, Distilling Free-Form Natural Laws from Experimental Data, is about a system learning from physics experiments and deriving a hypothesis from the data (this is along the lines of the general idea I’ve written about in the past). Cool stuff.

Print This Post

Torpedo-Reviews in Machine Learning Conferences

Interesting post over at hunch.net about reviewers bidding for papers in order to shoot them down. Make sure to read the comments… That state of mind of some reviewers might explain why the least-informative and most negative reviews always come with the highest confidence rating in ML conferences (specifically NIPS).