Results 1 to 4 of 4

Thread: Extremely heavy I/O, forum slowness

  1. #1
    Simetrical's Avatar Former Chief Technician
    Join Date
    Nov 2004
    Location
    θ = π/0.6293, φ = π/1.293, ρ = 6,360 km
    Posts
    20,299

    Default Extremely heavy I/O, forum slowness

    The forum seemed awfully slow, so I poked around a bit. I found unexpectedly high bo (blocks written per second) on vmstat, so I looked at iostat -kx 5. The output I got was like this:
    Code:
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               4.64    0.00    0.23   31.56    0.00   63.57
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
    sda               0.00     4.30    0.60   12.80     6.00  5422.80   810.27    14.13 1728.73  21.34  28.60
    sdb               0.00     4.30    0.30   18.90     4.80  7608.00   793.00    94.38 5554.69  52.08 100.00
    That says that in that sampling interval, over 5 MB was written to each disk, the second disk was 100% utilized, and the average wait for requests to succeed (await) was 1.7 or 5.5 seconds! That's seriously problematic. svctm (time spent actually serving requests on average) was reasonably low, so I'm not sure what that means. The service time times number of requests doesn't seem to add up to anywhere near the await.

    Anyway, there's clearly a problem. I looked at sudo iotop -d10 and hit space until I got an interval with 50 MB/s of writes, then looked over the process list for the culprit. Interestingly, one php-cgi process was doing all the writing: it wasn't MySQL's fault. Unfortunately, I'm not sure how to figure out what that process was doing at the time, like what files it was actually writing to.

    The problem seems to have vanished for the moment, but I expect it will reoccur at some point. I can't think of any reason why a since php-cgi process would ever be doing 50 MB of writes at once. php-cgi is where the PHP programs execute from, like vBulletin or MediaWiki. Maybe it was an upload of some kind, or an attachment gone awry?

    If we're having too many writes, the best solution I can think of offhand is upgrade to a more recent version of Ubuntu (like 9.10, which was just released) and then switch from ext3 to ext4 for our filesystem. ext4 should have much better performance under occasional heavy writes like this: ext3 more or less flushes all writes every five seconds by default, while ext4 will keep them in memory for much longer, only flushing them when it's convenient and won't disrupt anything.

    It might also be possible to improve performance on ext3 by reconfiguring it. I would poke at that, but it looks like any significant reconfiguration would require downtime to unmount and remount the filesystems, which probably isn't worth it.

    sdb seems to be doing a lot worse than sda, incidentally. Are they identical hardware?
    MediaWiki developer, TWC Chief Technician
    NetHack player (nao info)


    Risen from Prey

  2. #2
    GrnEyedDvl's Avatar Barackolypse Now
    Join Date
    Jan 2007
    Location
    Denver CO
    Posts
    20,990

    Default Re: Extremely heavy I/O, forum slowness

    Quote Originally Posted by Simetrical View Post
    The problem seems to have vanished for the moment, but I expect it will reoccur at some point. I can't think of any reason why a since php-cgi process would ever be doing 50 MB of writes at once. php-cgi is where the PHP programs execute from, like vBulletin or MediaWiki. Maybe it was an upload of some kind, or an attachment gone awry?
    Maybe. If it was something like that then I doubt it was an actual file upoad. File 894 was uploaded 6 days ago. File 895 was uploaded today but its not that big (1.64 mb). I guess its possible it got stuck but I just downloaded it and eveything worked normally. I didnt dig through file attachments for today.



    If we're having too many writes, the best solution I can think of offhand is upgrade to a more recent version of Ubuntu (like 9.10, which was just released) and then switch from ext3 to ext4 for our filesystem. ext4 should have much better performance under occasional heavy writes like this: ext3 more or less flushes all writes every five seconds by default, while ext4 will keep them in memory for much longer, only flushing them when it's convenient and won't disrupt anything.

    It might also be possible to improve performance on ext3 by reconfiguring it. I would poke at that, but it looks like any significant reconfiguration would require downtime to unmount and remount the filesystems, which probably isn't worth it.
    Given a choice I would say upgrade to 9.10 instead of just trying to reconfigure ext3. Probably more benefit for the amount of time involved. We knew we were going to have another upgrade at some point anyways.



    sdb seems to be doing a lot worse than sda, incidentally. Are they identical hardware?
    They are. Maybe one is about to fail. I wouldnt expect it after less than a year online but its certainly possible, and drives are cheap if that is the case. SMART reporting is enabled on the drives so we can use smartmontools or something similar to get reports on the drives without shutting the machine down and checking it from BIOS. I went ahead and installed that. Neither drive is reporting any errors.

    Report for sda:
    sudo smartctl -a /dev/sda
    Spoiler Alert, click show to read: 
    Code:
    === START OF INFORMATION SECTION ===
    Device Model:     WDC WD5000AACS-00ZUB0
    Serial Number:    WD-WCASU6756871
    Firmware Version: 01.01B01
    User Capacity:    500,107,862,016 bytes
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   8
    ATA Standard is:  Exact ATA specification draft version not indicated
    Local Time is:    Fri Nov 13 15:52:04 2009 MST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    General SMART Values:
    Offline data collection status:  (0x84) Offline data collection activity
                                            was suspended by an interrupting command from host.
                                            Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever 
                                            been run.
    Total time to complete Offline 
    data collection:                 (13200) seconds.
    Offline data collection
    capabilities:                    (0x7b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:        (   2) minutes.
    Extended self-test routine
    recommended polling time:        ( 154) minutes.
    Conveyance self-test routine
    recommended polling time:        (   5) minutes.
    SCT capabilities:              (0x303f) SCT Status supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0003   171   168   021    Pre-fail  Always       -       4416
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       145
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6872
     10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       145
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       117
    193 Load_Cycle_Count        0x0032   183   183   000    Old_age   Always       -       53082
    194 Temperature_Celsius     0x0022   120   106   000    Old_age   Always       -       27
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0
    SMART Error Log Version: 1
    No Errors Logged
    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]
     
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.



    Report for sdb:
    smartctl -a /dev/sdb
    Spoiler Alert, click show to read: 
    Code:
    smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
    Home page is http://smartmontools.sourceforge.net/
    === START OF INFORMATION SECTION ===
    Device Model:     WDC WD5000AACS-00ZUB0
    Serial Number:    WD-WCASU6725445
    Firmware Version: 01.01B01
    User Capacity:    500,107,862,016 bytes
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   8
    ATA Standard is:  Exact ATA specification draft version not indicated
    Local Time is:    Fri Nov 13 15:54:50 2009 MST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    General SMART Values:
    Offline data collection status:  (0x84) Offline data collection activity
                                            was suspended by an interrupting command from host.
                                            Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever 
                                            been run.
    Total time to complete Offline 
    data collection:                 (13560) seconds.
    Offline data collection
    capabilities:                    (0x7b) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            Offline surface scan supported.
                                            Self-test supported.
                                            Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:        (   2) minutes.
    Extended self-test routine
    recommended polling time:        ( 158) minutes.
    Conveyance self-test routine
    recommended polling time:        (   5) minutes.
    SCT capabilities:              (0x303f) SCT Status supported.
                                            SCT Feature Control supported.
                                            SCT Data Table supported.
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0003   163   156   021    Pre-fail  Always       -       4850
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       145
      5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6831
     10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       145
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       171
    193 Load_Cycle_Count        0x0032   184   184   000    Old_age   Always       -       50957
    194 Temperature_Celsius     0x0022   122   109   000    Old_age   Always       -       25
    196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0
    SMART Error Log Version: 1
    No Errors Logged
    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]
     
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.



    Commands for using smartmontools are as follows:
    Spoiler Alert, click show to read: 
    Code:
    Usage: smartctl [options] device
    ============================================ SHOW INFORMATION OPTIONS =====
      -h, --help, --usage
             Display this help and exit
      -V, --version, --copyright, --license
             Print license, copyright, and version information and exit
      -i, --info                                                       
             Show identity information for device
      -a, --all                                                        
             Show all SMART information for device
    ================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS =====
      -q TYPE, --quietmode=TYPE                                           (ATA)
             Set smartctl quiet mode to one of: errorsonly, silent, noserial
      -d TYPE, --device=TYPE
             Specify device type to one of: ata, scsi, marvell, sat, 3ware,N
      -T TYPE, --tolerance=TYPE                                           (ATA)
             Tolerance: normal, conservative, permissive, verypermissive
      -b TYPE, --badsum=TYPE                                              (ATA)
             Set action on bad checksum to one of: warn, exit, ignore
      -r TYPE, --report=TYPE
             Report transactions (see man page)
      -n MODE, --nocheck=MODE                                             (ATA)
             No check if: never, sleep, standby, idle (see man page)
    ============================== DEVICE FEATURE ENABLE/DISABLE COMMANDS =====
      -s VALUE, --smart=VALUE
            Enable/disable SMART on device (on/off)
      -o VALUE, --offlineauto=VALUE                                       (ATA)
            Enable/disable automatic offline testing on device (on/off)
      -S VALUE, --saveauto=VALUE                                          (ATA)
            Enable/disable Attribute autosave on device (on/off)
    ======================================= READ AND DISPLAY DATA OPTIONS =====
      -H, --health
            Show device SMART health status
      -c, --capabilities                                                  (ATA)
            Show device SMART capabilities
      -A, --attributes                                                         
            Show device SMART vendor-specific Attributes and values
      -l TYPE, --log=TYPE
            Show device log. TYPE: error, selftest, selective, directory,
                                   background, scttemp[sts,hist]
      -v N,OPTION , --vendorattribute=N,OPTION                            (ATA)
            Set display OPTION for vendor Attribute N (see man page)
      -F TYPE, --firmwarebug=TYPE                                         (ATA)
            Use firmware bug workaround: none, samsung, samsung2,
                                         samsung3, swapid
      -P TYPE, --presets=TYPE                                             (ATA)
            Drive-specific presets: use, ignore, show, showall
    ============================================ DEVICE SELF-TEST OPTIONS =====
      -t TEST, --test=TEST
            Run test. TEST: offline short long conveyance select,M-N
                            pending,N afterselect,[on|off] scttempint,N[,p]
      -C, --captive
            Do test in captive mode (along with -t)
      -X, --abort
            Abort any non-captive test on device
    =================================================== SMARTCTL EXAMPLES =====
      smartctl --all /dev/hda                    (Prints all SMART information)
      smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
                                                  (Enables SMART on first disk)
      smartctl --test=long /dev/hda          (Executes extended disk self-test)
      smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda
                                          (Prints Self-Test & Attribute errors)
      smartctl --all --device=3ware,2 /dev/sda
      smartctl --all --device=3ware,2 /dev/twe0
      smartctl --all --device=3ware,2 /dev/twa0
              (Prints all SMART info for 3rd ATA disk on 3ware RAID controller)
      smartctl --all --device=hpt,1/1/3 /dev/sda
              (Prints all SMART info for the SATA disk attached to the 3rd PMPort
               of the 1st channel on the 1st HighPoint RAID controller)



    We probably need to run an extended test on both drives but I am not going to do that on a Friday night.

  3. #3
    Simetrical's Avatar Former Chief Technician
    Join Date
    Nov 2004
    Location
    θ = π/0.6293, φ = π/1.293, ρ = 6,360 km
    Posts
    20,299

    Default Re: Extremely heavy I/O, forum slowness

    Okay, this is happening again and getting worse. I'll try to remember to ask some people I know how you can figure out what files are actually being written to heavily. Failing that, upgrading to 9.04 and remounting the filesystems as ext4 might cause the problem to disappear completely. But it's possible to lose a couple minutes' data on crash, with ext4, on the default settings. It will buffer writes much more heavily in memory, and that gives much better write performance, but if the machine crashes you lose up to two and a half minutes instead of five seconds, by default. Worth it for us, I think.

    Incidentally, if you need to get anything done when the machine is this slow, you can use sudo renice -20 $$. That will set your terminal to -20, the lowest nice level/highest priority. But be careful; every process you start will also get this priority, so if you do /etc/init.d/lighttpd restart, say, you might make lighttpd and all the php-cgi processes super high-priority, which could cause trouble. renice 0 $$ will reverse the effect (you don't need sudo to restore your priority to normal).
    MediaWiki developer, TWC Chief Technician
    NetHack player (nao info)


    Risen from Prey

  4. #4
    Viking Prince's Avatar Horrible(ly cute)
    Join Date
    Apr 2008
    Location
    Colorado, USA
    Posts
    18,646

    Default Re: Extremely heavy I/O, forum slowness

    I do not know if this is related or not. I have attempted to edit the Eagle Standard as posted. I have some bad links within the first part that need to be replaced or at least deleted. The system times out on attempting to post the edit. When I attempt to edit Part 2 the edit seems to work OK, so I assume it is a file size issue.

    THis was happening last night (4:36 AM MST) part 2 was successfully edited. Attempts were made to edit part 1 both before and after editing part 2. I attempted to do so again today just prior to posting here.
    Last edited by Viking Prince; November 30, 2009 at 03:12 PM.
    Grandson of Silver Guard, son of Maverick, and father to Mr MM|Rebel6666|Beer Money |bastard stepfather to Ferrets54
    The Scriptorium is looking for great articles. Don't be bashful, we can help with the formatting and punctuation. I am only a pm away to you becoming a published author within the best archive of articles around.
    Post a challenge and start a debate
    Garb's Fight Club - the Challenge thread






    .


    Quote Originally Posted by Simon Cashmere View Post
    Weighing into threads with the steel capped boots on just because you disagree with my viewpoints, is just embarrassing.

















    Quote Originally Posted by Hagar_the_Horrible
    As you journey through life take a minute every now and then to give a thought for the other fellow. He could be plotting something.


Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •