[Comp-neuro] *****SPAM***** Massive Brain Simulations - A / Part 1 of 7

Ovidiu Anghelidi raidvvan at yahoo.com
Wed Jun 29 03:35:11 CEST 2011


Dear Sir or Madam,

My name is Ovidiu Anghelidi and I ran the world's largest brain simulation on over 14,000 computers using the Boinc distributed computing  architecture (see the following Discovery Channel article: http://news.discovery.com/tech/cat-brain-computer-hype.html). Over the next 7 messages I am going to present my work and my results that span more than a decade. 

Part 1. Massive Brain Simulations - A
Part 2. Massive Brain Simulations - B
Part 3. Neuron Types
Part 4. Object Types - Cellular and Informational
Part 5. Neuronal Generation
Part 6. Visualization
Part 7. Basic mechanisms of knowledge representation

Over the past decades there have been very few massive simulations, that bypassed the 1 billion mark, mainly due to the volume of effort involved and the lack of theoretical findings. A simulation of a couple of thousand neurons can provide the same insights, so there is little justification 
for that. I found value in building such a simulation and pushing the knowledge frontier further on.

Most of the existing biophysical simulators (e.g. Neuron, Genesis, BNSF_1_0, BrainSim, Catacomb, HHSim, NCS5, NEUROFIT, Nesim, Conical, Lifnet, Neurocad, Nc, ECell, SurfHippo, Neuralc, Neurocuda, BioSim, Mvaspike, Nengo, Nodus, Neosim...) tend to perform the simulations in memory. Some of them can scale by using communication protocols like MPI/PVM and having the computations distributed over many machines. Very few use databases for storage. That is a limitation. While HDF5 was proposed by INCF I chosed to use SQLite, a public domain embedded database (see SQLite.org). I stored over 100 Terabytes of neuronal data during the distributed computing simulation version. I designed and developed the simulator (over 60,000 lines of code) on Linux. The SQLite database supports large transactions and I can group in excess of 250,000 SQL statements in a single transaction, using dynamic strings, to update a single file. 
I created a novel algorithm that supports unlimited data storage per machine by splitting tables over many database files and creating a hierarchical structure. I imposed a soft limit of 64 MB per database file as seen in Google BigTable. 

I partitioned the entire neuronal 3D space in zones (e.g. neocortex, ventral thalamus, dentate gyrus, hippocampal CA1-4, ...), and each zone in cubes. A cube has 10,000 um and can have a variable number of layers (e.g. neocortex 6, dentate gyrus 3) based on data found in the literature. Each cube contains specific number of neurons generated using distributions found again, in literature. A zone can have an unlimited number of cubes. Storing clinical data for large number of patients would be feasible.

I imported the entire NeuroLex.org database and assigned an integer ID to each name. I called that a bio-type and I have used it as a reference throughout the application. 

I used object models for storing the neuronal data (i.e. cellular, sub-cellular and macro-molecular mechanism) in a single simulation platform. I haven't used the neuron as a single unit of computation, instead I used an abstract object that can contain any properties and also contains a list of interactions with other objects and the properties shared in the interactions. I assigned long integer ID's for the object type of each mechanism (celullar or informational) and also for each object property. 
For an injection current (object type 10) that is interacting with a sigmoid neuron (object type 19) we have an interaction with one property value used, the injection current value.

I used different time steps for different objects. An injection current that has a positive value at time step 500 ms, and that has a constant value for the next 20 ms needs to be computed only at time step 500 and 520, not in-between. Some objects can have constant property values over an interval of time. 

I used detailed compartmental models. 

UNLIMITED STORAGE ALGORITHM

(file)							  ENTITY (ROOT)
								 |
(records)		  (SIMULATION)
			   |
(files)	  	FILE OBJECT ENTITIES #1  		 FILE OBJECT ENTITIES #2
			   |				    |
			   |		    		    |
(records) OBJECTS #1	OBJECTS #2	  		OBJECTS #3		OBJECTS #4

Every time a new file is created a record must be inserted in a file that is above it (i.e. its parent contains all files for the same level); 
the rule applies to entity files as well. 

The files tree is a true tree-like structure. There are no nodes (i.e. files or records) that belong to more than one parent.

The parent-children relation replaces the single foreign key relation from a classical database model; it can also be thought of as a 
summary-details relation. 

WE ALWAYS EXTEND HIGHER IN HIERARCHY WITH ENTITIES. AS SOON AS A FILE IS CREATED, A NEW RECORD OF IT MUST BE ENTERED IN A FILE ABOVE RESULTING IN UNLIMITED 
GROWTH IN TREE HEIGHT. IF THE ROOT FILE HAS EXCEEDED ITS LIMIT WE CAN ASSIGN A NEW UUID TO IT AND DEFINE ANOTHER FILE TO BE ROOT AT THE NEXT UPPER LEVEL. 

ENTITY FILES WILL BE CREATED ON DEMAND. Advantages: it will minimize the number of files required to be created and opened in order to reach a 
file or record that is lower down the tree and it will also minimize concurrential access. If we were to create them in the beginning we will end up 
creating much more files than needed, because not all relations require entity files from the outset. The [Sensor] - [Sensor Data] relation will be 
transformed into a [Sensor] - [Entity] - [Sensor Data] relation only after there will be more than 100,000 records in the [Sensor Data] file.

Entity files contain list of file names and their file addresses locations. We have 
the ordered sequence so that we can easily change the address location if needed. 
We start the order from the file up so that we can use the address id of the file 
as parent id for the records that belong to the same location. Having Entity_ID
and Node Order allows us to expand the address by adding more entries if required (e.g.
add a rack field, or a subdomain location field). 

The UUID of the file name and the Information.[Parent ID] property referenced from the parent file, form a pair of values that is a UNIQUE address.

Entities and Entity_Addresses tables allows us to define multiple entities pointing to the same address and thus conserving space. 

Root entity is a separate entity file type and we can easily add entities on top of it, and have another file become root if we need to extend above its size.

Parent_ID and Parent_Entity_ID properties from the Information table make the Entity_Parents table obsolete because we a have a true tree structure and no file or record can have more than one parent. Parent_ID and Parent_Entity_ID are like a foreign key relationship because they identify the parent record and in which file it can be found. Parent_ID can point to an entity_id in a file above if we have entities above for that file type.

Entity_Children table lists the entity files where the main table record has data. Because the main table contains a list of records, and each record may have children we need to define in the [Record_ID] field the id that is the parent to all the Entity_Children. 

We use ONE FILE TYPE, the Child_Type, listed in the Entity_Children table per main table record.
	Simulations.db file
-----------------------------------------------------------------------
		Simulations table 
		-------------------------------------------------------------------
			ID 
			1 (simulation)
		Entity_Children	table
		-------------------------------------------------------------------
			ID	Simulation_ID	Child_Type		Entity_ID
			1	1				2 (object)		1
			2	1				5 (time step)	2
			3	1				10 (sensor)		4
			...............................................................
				(other children types like controller, variation... 
				for the same simulation id 1)
			...............................................................
			4	1				2 (object)		5
			5	1				5 (time step)	7
			6	1				10 (sensor)		9
			...............................................................
				(other children types like controller, variation... 
				for the same simulation id 1)
			...............................................................
	Entities.Entity_Type can be either of entity type or belong to some other file type
	(i.e. objects, sensors...).

	Entities.db file that has a sensor parent right above it
	-----------------------------------------------------------------------
		Entities table 
		-------------------------------------------------------------------
			ID 	Type		
			1 	19 (sensor data)
			2 	19 (sensor data)
	-----------------------------------------------------------------------

	Entities.db file that has a top sensor parent 
				and an entity parent right above it
	-----------------------------------------------------------------------
		Entities table 
		-------------------------------------------------------------------
			ID 	Type		
			1 	1 (entity)
			2 	1 (entity)
	-----------------------------------------------------------------------

For an entity file type we do not need the Entity_Children table because the children are
listed in Entities table. 

ENTITIES FILE:
	Entity_Type is File_Type. Each entity file contains references to the same file type only 
	(i.e. entities for objects, entities for logs...).
	We have one entity for each entity type defined in the references table (i.e. Entity_ID 
	for objects, Entity_ID for logs...). 

	Each entity file belongs to one record only. If we have a file with 100,000 sensors 
	we would need 100,000 entities for 100,000 sensor data files; one entity for each sensor 
	data. That may not always be the case due to the on-demand entity creation, but if we would 
	have a lot of data for all the sensors we may end up with one entity file if we have at least
	two sensor data files for a sensor.

SIMULATIONS FILE:
	Parent_ID is Entity_ID of the above root/entity file if we defined a root file. 
	
	Data consolidation is required if the number of entities referenced by a Simulation_ID file is 
	too large. If a Simulations file has for one Simulation_ID 1000 entries under Entities
	table from the same file, we can create a new entity file that points to those 1000 files 
	and have the Simulation_ID point only to one entity file instead of the 1000 files. 

	For all files of type simulations (e.g. objects) that reference other data files (e.g. simulations
	reference time_steps and objects; objects reference graphics and interaction objects) we define
	entity files by type only; each entity hierarchy will be for a specific file type only
	(i.e. graphics will have entities that contain only references to graphics files only). We will 
	end up with x more entity files, where x is the number of references for each file type
	(i.e. if simulations reference objects and time steps, x=2; if objects reference interactions, 
	graphics, histories then x=3). That is a small space cost and we can gain order in the hierarchies 
	of entities.

Sensors
-------------------------
ID | Sensor_Type_ID     |
-------------------------
1  | 1				   |
2  | 1				    |
-------------------------

Entities
-----------
ID | Type |
-----------
1  | 1	  |
2  | 1	  |
-----------

Parent_ID and Parent_Entity_ID properties point to either:
	an entity record and entity file above 
		or 
	a record id and record database file

Entity Children
------------------------------------------
ID | Sensor ID  | Child_Type | Entity ID |
------------------------------------------
1  | 1	  	    | 19		 | 3	     |
2  | 1	  	    | 19		 | 5	     |
3  | 2	  	    | 19		 | 7	     |
3  | 2	  	    | 19		 | 8	     |
------------------------------------------
Child_Type is 19 for Sensor Data type file.



CROSS-PLATFORM COMPATIBILITY
I used the C language with ANSI specifications; I run the application on Linux, Windows, MacOS, Unix and other operating systems using BOINC.

ZERO ENTITY COLLISION
I stored the name of database files in UUID format, in order to prevent collisions with files defined across different machines.

UNIVERSAL DATA STORAGE
I used a file addressing system.

ZERO DATABASE ADMINISTRATION
I used an embedded database for data storage.

ZERO DATA LOSS
I used ACID compliant transactions that protect us against hardware & software crashes; I can replicate the same data across multiple machines.

FAST DATA ACCESS
I used non-linear data access using B-TREEs, database connection caching and ram disk memory storage.

ZERO DATA CONCURRENCY
I used one writer and multiple readers for each database file.

HISTORICAL DATA STORAGE
I stored in Property_Values table only the current property values that have been changed since last time step; there are object property values that 
have not changed since object creation and other ones that are changed at each time step (e.g. membrane voltage); the History_Property_Values will 
store the property values that have been changed over time.

ZERO DATA CORRUPTION IN MEMORY
I used Valgrind to identify memory leaks.

CELLULAR MODELLING SCALABILITY
I use generalized data structures for storage of any cell model.

UNIVERSAL INTEGRATION METHODS
I used multiple integration methods (Runge-Kutta4, Backward Euler...) during simulation of different objects.

ONE INSTALLATION STEP
I used a self-contained executable that does not require external libraries; setup requires file copy only and minimal configuration.

UNIVERSAL GRAPHICS PLATFORM
I used OpenGL for visualization.

SINGLE-THREADED MULTIPLE PROCESSES
I used a single-threaded model for the application but I run multiple processes.

ZERO APPLICATION CHANGES FOR NEW CELL MODELS
I used a modular architecture.

SINGLE POINT OF CONTROL
I used a single point of control for the simulation.

COMPUTATIONAL PROCESSING SCALABILITY
I used automatic process creation and destruction based on each machine's available resources; the application can scale up to the maximum available 
resources by using a manager-type application.

ZERO DOWNTIME
I have the core system working even when performing updates.

UNIVERSAL DATA VISUALIZATION
I can visualize cellular data on any machine and at any level, from macro-molecular to system level.

INTER-PROCESS COMMUNICATION
I communicate between processes on different machines using TCP/IP sockets (stream sockets) and between local processes using shared memory and semaphores.

KNOWLEDGE REPRESENTATION MECHANISMS
I employed the answer to "What is abstraction?".

I do apologize for using this list as opposed to sending the information to a journal but somehow my background or the lack of it, doesn't allow me 
to do that. 

Thank you.



More information about the Comp-neuro mailing list