Wednesday 20 June 2012

Another BIOS setting for IBM X Series ESX Host

I keep getting these errors:


0x806F050CErrorMemory device X (DIMM XStatus) correctable ECC memory error logging limit reached [Note X = 1-12]


The suggestions above aren't all helpful as it takes a long time for these errors to occur, so moving the memory to another slot to confirm whether the problem is with the DIMM or slot is impractical.

A colleague helpfully remembered a problem on HP hosts that sounded similar. He got me looking and I found this BIOS/IMM setting:

Changing "Normal" mode to "Performance" mode affects the way that the DIMMS are refreshed.  This results in a DIMM temperature message occurring at a 10 degree lower temperature.

This article is not about my X3650, but IBM has verbally confirmed it applies to my server:


Change Thermal Mode setting (preferred method)
  1. Boot the blade into the F1 "System Configuration and Boot Management" screen. Highlight "System Settings." Press Enter and select Memory. Select Thermal Mode and change the setting to "Performance."
  2. Press the Esc key twice to get to "System Configuration and Boot Management" and then selectSave Settings and Exit Setup.
  3. Follow the instructions on the next screen to exit the "Setup Utility."
  4. Power the blade off for the changes to take effect and restart.
Changing "Normal" mode to "Performance" mode affects the way that the Dual In-Line Memory Modules (DIMMs) are refreshed. This results in a DIMM temperature warning message occurring at a 10 degree lower temperature. This causes no impact in most industry standard data centers.


Again, I don't have a blade but I seemed to have guessed correctly that they run the same code on the X Series.  Odd I haven't found much about this online.  It should be in a best practices document for IBM servers, maybe even a vSphere document.  Props to "VTSUkanov" for finding and posting about this on the VMware forums.


OSSV Pre-exec (and Post-exec scripts)

NetApp Management Console:
    Protection:
     Overview (select the Policy like OSSV that you copied from a template)
   Edit
     Nodes and Connections
       Primary Data tab
section:

Backup Script

entries:
Path: c:\temp\ossv_vl112_test.bat
Run As: (left blank)

Oddly it runs this script twice, once before and once after.  Silly, they should have, like every other proper backup software, a field for "pre" and a field for "post".  Very unprofessional of NetApp to not document this better, me thinks.  Or Does it run four times?

My script echoed that the variable DP_BACKUP_STATUS is set to four different things, each of the four times the script gets ran from my DFM OSSV backup job:


DP_BACKUP_STATUS=DP_BEFORE_TRANSFERS
DP_BACKUP_STATUS=DP_AFTER_TRANSFERS
DP_BACKUP_STATUS=DP_AFTER_BACKUP_REGISTRATION
DP_BACKUP_STATUS=DP_BEFORE_PRIMARY_SNAPSHOTS

Maybe it's different when it's scheduled verses ran with "Protect Now"

New version of DFM must have changed variables as they used to expect:


DP_BACKUP_STATUS=DP_BEFORE_SNAPSHOTS


and the post will have

DP_BACKUP_STATUS=DP_AFTER_SNAPSHOTS


Ah, thanks to Marlon on the NetApp forums

c:\DFM_scripts\ssh_ossv_hostname_pre.sh (runs the ssh to quiesce database)

c:\DFM_scripts\

Still, I don't understand what this bit from the  OSSV FAQ:  is on about though:

Q: Does the pre/post scripting capability in DFM work with OSSV?
A: Yes, you can use the DFM pre/post script ability to run commands on the host prior
to, or following an OSSV transfer.  The scripts are installed on the DFM server using
a “zip” file.  The “zip” file must contain the script (in PERL), and a XML File named
package.xml.  The package.xml file must include packaging information (version, file
name…) and the privileges needed to run the script.  Once the “zip” file has been
created, it can be imported into DFM and ran either manually or via a schedule set in
DFM.

Limitations, Limitations, Limitations:

There's only one pre-exec, post-exec script field for each backup job, even though it makes sense to backup about 10 OSSV clients per job.  Plus the backup job only runs a script on the DFM server, and obviously the OSSV clients need their databases quiesced.  That means you need to setup ssh and an ssh-key relationship between the DFM server and the OSSV client and get the DFM script launch a script remotely on the OSSV client by way of ssh--whew!

I install Cygwin for the ssh, ssh-key and a cygwin shell script to kick off the Linux script on the OSSV client from the DFM script, by way of OSSV in DFM. Again, whew!

Another limitation is that the OSSV backup environment variables only seem to track which stage of the OSSV backup job initiated it.  Nothing I can find about the name of the backup job, the OSSV client currently being backed up, or anything else I can put in my scripts to differentiate.  What we need to avoid is the same script running on all clients, needlessly and repeatedly.

CygWin & Windows Not Playing Nicely

You might use PowerShell to avoid some of this, but does powershell do ssh?  I had loads of problems if I copied my Cygwin linux shell by using windows (drag/drop in Explorer or from DOS command) as it changes the file format line endings (carriage return/line feed).  Same with if a DOS batch file is edited by VI or copied from Cygwin shell.  Ouch!   This ate much of my time.  Don't let it get you.

Monday 18 June 2012

I'm on slide 28 of 56 of the Networker Overview eLearning course

I'm on slide 28 of 56 of the Networker Overview eLearning course.  I had to take a break to avoid going crazy, so I thought I'd post my thoughts about this format and EMC's use of it.

When you login to the account with the credits for an eCourse which has been booked, the EMC site has a link under education for you to click on to get to the course.  Weirdly, when  I clicked on that it just took me back to the main page of the site again (but it seems I was logged in with a temporary system generated username, perhaps in a virtual session of some kind).  Then there was no pointer or tip or explanation but I guessed to navigate to the same area again.  This time it started "Saba" in my browser (I only tried FireFox) and gave me access to the course and accompanying PDF for me to download.

Ok, the content is worthwhile.  They have information here that is more detailed and exhaustive than what I can find on their forums. The info is slightly more user friendly than the manuals.

But they really missed an opportunity to make this great and the envy of the IT world of training courses!  Come on EMC you're big enough, you have enough money.  You should be able to make these eCourses shine and rock!

What Sux:
The voice actors hired to read out the text of the materials obviously don't know or care anything about the material they are reading about.  They sound like robots.  It's not quite as bad as text to speech programs where you get a robot voice reading our your word with no clue to the inflection or emphasis. But there isn't much more humanity here.

It's surprisingly like a bunch of power point slides with someone reading out the text. and waiting for you to click next so they can read out the next slide, and so on.  They actually say, "This module covers the topics on this slide.  Please take the time to familiarise yourself with the topics on this slide".  Man, that's cheating. 
I


What They Should Do

They should aim to make these as good as real courses with real instructors.  They should add some personal anecdotes to help get the concept across, add some humour or at least humanity!

They should have someone draw some diagrams on a whiteboard and let us see what he's drawing while hearing him explain it.  Again, aim for the best part of real instructor led courses and see how you can come as close as possible.

They should take some useful questions from the dozens and dozens of courses that have been held already, and interject them into the course.  Questions and answers.  See it as a way to review the material or explain it in a new way.  This takes thought and brains, but it's what would set apart anyone who did courses like this from rubbish like what I'm enduring right now.

VMware has surpassed their mother company EMC with their VMWorld presentations which are available online.  You can hear the presenter, a real human, someone with experience in the product and passion about their topic. While listening to them you can see the slides they're showing.  In a way it's lower tech as I've not see any animations, but the bits I've mentioned in this paragraph far surpass the differences between a flash animation in the EMC vCourses and simple powerpoint + audio used in VMWorld presentations.

Friday 15 June 2012

Snapshots, Deduplication and QTrees

Having not been on a NetApp course yet, it makes it "fun" trying to understand how these three concepts work together:  Snapshots, Deduplicaiton and QTrees (not to mention Volumes and SnapVault)

So here's my notes and a place to write anything I might figure out.

https://kb.netapp.com/support/index?page=content&id=1010363

https://library.netapp.com/ecm/ecm_get_file/ECMM1278402 (PDF)

OSSV FAQ not as easy to find as I'd expect. It's got some good stuff like Windows System State backup (registry, AD, etc.) and excluding files/paths in OSSV Backups and:

Q: Anytime I restore even a single file, I have to perform a full baseline or
reinitialize of my primary file system?
A: No.  If you run a full D/R restore, you need to re-initialize.  If you drag/drop a file,
then it should behave reasonably. In that case, it's just as if the user
created/modified a file.  


Q: Is OSSV 2.2 able to adress backing up Operating Systems?
A: OSSV 2.2 can backup Windows 2000/2003 but not the Unix platoforms.


Q: How does OSSV actually transfer data from primary to secondary
system?
A: Data is moved via TCP/IP network using TCP port 10566. The communications
protocol is QSM (based on Qtree-SnapMirror). This is not to be confused with NDMP
protocol. NDMP is used by NDMP-based management applications (DFM) for
management and control of the SnapVault primary and secondary systems. The
NDMP TCP port is 10000.


So, how do snapshots keep straight the different hosts' data being SnapVaulted since the snapshots are at the volume level?  I different snapshot must be created for each host/qtree, but how many SnapVault relationships can exist on one volume at a time?

Just posted a question on the NetApp forums as I'm not having any luck creating OSSV relationships like the diagram above.

Wednesday 13 June 2012

Troubleshooting Checklist

We can't guess which order will get to the solution quickest, but here's a stab at some things to remember:

1).  What does the error say (I can jump to conclusions and miss the clues right in front of my eyes)
      If it says the hostfile is missing an entry, then it might be.

2)   What has changed since things were working?  Look for the culprit to be that thing that was changed 5 minutes ago or last week before trying every single link in the chain. 

3)   Which component could have the root cause?  If other network connections are fine then it's not the entire network (but it might be one port or one module, one switch or one data center that owns the problem.

4)   Draw a picture.  It's all about isolating what is not wrong, when you rule out all but one thing, the one remaining thing is the culprit!  So draw a picture to make it clear the way the components connect and to see visually.  Remember #1 above, try to not make assumptions.  That's where we usually miss the cause of the problem--it hides within the sensible and understandable but incorrect assumptions!

5)  Two heads are better than one.  It might just be that explaining and drawing the problem to a colleague, which forces you to explain it simply, think clearly will lead you to the "Aha!" moment.  Or they might see something you've missed or get lucky where you're unlucky.

This can all be very frustrating.  Take a step back, try to not get angry.


People wouldn't like you when you get angry.