0

I am installing a fresh Ubuntu on our lab server. We have lots of massive genomes which need to be accessed by Apache www-data user. Currently I backed all the data up on external drives. My goal is to have a fresh Ubuntu, install new web apps on it, and then import lost of old data so that Apache would serve it to users using these new apps. Users would also upload files. The priority is to keep things simple so that a new future system administrator could easily catch up with how things work on the server. My current plan:

1) Have in-lab person (I'm out of state) burn Ubuntu ISO cd, boot the machine from it and perform basic ubuntu installation, set up SSH access for me. She would reformat internal disk except for /home folder which is on a separate partition.

2) Migrate users from old installation; manually clean up unnecessary data from /home (old) folder. Replace new /home folder with it.

3) Install LAMP, web apps, and other necessary software.

4) Create /home/user/webdata folder, give Apache user all permissions to it. Inside it, create upload/ folder where website users would be uploading files. Next to it would be genomes/ folder containing symbolic links to genomes physically located on external drive. Apache would serve genomes to users from this folder.

5) Set up automatic backup of /home/user/webdata/ and put the thing up online.

I don't have experience in system administration, so I have the following doubts:

a) Is keeping the data as described in step 4 inferior in any way? What would be the most common&efficient way to store and serve large genomes, and user uploads? Should I have this webdata/ folder under /var/www/html instead? Or should I not use symbolic links at all and keep genomes on the internal drive (under /home or /var)? One reason I don't like it under /var is because keeping everything under /home would be simple and safe.

b) Can any other steps be changed or added to make the process safer and more professional?

Thank you very much for the support, and let me know if I should provide any addition information.

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
zavidovych
  • 111
  • 3
  • 1
    Can you provide some info on the hardware setup you're using? It sounds like you're storing your data on a lot of single disks. – ErnieTheGeek Jun 29 '12 at 16:41
  • Processor: Intel(R) Xeon(R) CPU X5460 @ 3.16GHz, quad Memory: 24.3 Gb Storage: 1Tb internal hd (2 partitions: /home and everything else), 4x2Tb external disks with genomic information Thanks! – zavidovych Jun 29 '12 at 20:56

2 Answers2

3

To me, the file structure of having an uploads folder and a genomes folder sounds pretty standard based on the webapps I've set up.

This is a really sysadmin centric perspective, but to me while organization of file structures is important from a software/application perspective, the physical setup will have a greater impact on redundancy, reliability, and performance - things I might include when measuring the "professionalism" of a setup.

Some recommendations I might have:

1.) Buy a small NAS if you can. External drives don't have any redundancy, and speeds are going to vary, especially if you have multiple users reading/writing data on the same disk.

2.) Consider the use of mount points for external attached data, and point Apache right at that. If you stick with the genomes/uploads structure, you might consider mounting external storage right to those folders, or symlink to shares on the /mnt directory.

3.) Really consider reads and writes for operations and the number of users you serve. If gnomes are large, and you're going to have a lot of long, sequential reads, put that data on a separate volume/set of disks, keeping it separate from more write focused "Uploads" folder. If you have to stick with single disks, or multiple individual disks, you could separate out the data onto separate disks, putting genome data together on one set of disks, and uploads on the other.

Univ426
  • 2,149
  • 14
  • 26
  • Thanks a lot, this is very helpful! I'll probably mount an external hd to /home/user/webdata/genomes, and keep the /home/user/webdata/uploads on the internal hd, and there uploads wouldn't be very large. This way I'd keep everything within /home, won't use any symbolic links, and also would be able to disattach genomes any time. And yes definitely NAS is worth considering. – zavidovych Jun 29 '12 at 21:02
0

Like John says, from a sysadmin perspective, the physical setup is more important than the "organization" of the files and folders, because that has the biggest impact on the things sysadmins care about - reliability, performance, scalability, manageability, monitoring, redundancy, DR/backups, etc.

The idea of getting something set up "right" and migrating users over is a good one. First thing I'd do is try to get the data on a RAID array, so you don't lose data or have downtime when a drive inevitably fails. I'm a proponent of hardware RAID, but Linux software RAID isn't completely horrible either - you're looking to add some level of redundancy at the server level, and improve uptime. (And speaking of uptime, I hope there's a UPS feeding this server...)

Next, I'd set up a secondary server of some sort for this function. (In order of preference), I'd try to set it up a cluster, [sounds customer facing or impacting] or a failover, or even a hot-spare server. (A server that's ready and waiting to be pressed into service if/when the original dies). Having data redundancy won't help when the power supply dies or your motherboard gets shorted out, etc.

Finally, a backup solution, which will vary widely based on your needs and constraints. If you can set up a tape-backup, or disk-to-disk backups on an array big enough to provide a reasonable data retention period, that's great. If not, even a small consumer-grade NAS or two is better than nothing. Worst case-scenario, in situations with no budget, I've kept backups of important servers on my workstation drive, consumer-grade external USB drives, and even on spindles of DVD-Rs. The important thing is to make sure that you have some level of data retention. Having pristine backups from the previous night does you no good when you discover data corruption starting last week, or that you got rooted a month ago.

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
  • Thanks a lot, I'll definitely have to look more into introducing more redundancy and retention. Although I don't think we can easily afford a hot-spare server right now, but RAID and NAS are great options. – zavidovych Jun 29 '12 at 21:09